SQL Troubles

28 January 2010

💎SQL Reloaded: Query Writing I (Best Practices)

In general, when creating queries, the following best practices should be considered:
- Use ANSI-compliant syntax as much as possible.
- Use aliases for all the tables.
- Design for performance reusability and security.
- Reduce unnecessary network traffic.
- Group together the attributes coming from the same table.
- Rename columns in order to avoid confusion, though don’t overreact with this practice.
- Use uniform coding style, formatting and naming conventions
- Use indexed join predicates.
- Learn about the strengths/weaknesses of each feature before using it.
- Use versioning & keep older versions. - Test the queries.
- Write unit tests first [2].
- Handle missing values (NULLs).
- Document database objects inline and specific documents (e.g. Data Dictionaries, Functional Specifications).
- Refactor code
- Use SQL tunings tools.
- Write tiny chunks of code: encapsulate formulas and business logic in functions [2], avoid inline scalar functions.
- Defensive coding: use exception handling, consider all scenarios.

Things to avoid:
- Complex expressions in search conditions [1].
- Join predicates on expressions [1].
- Expressions over columns in local predicated [1].
- Data types mismatches on join columns [1].
- Non-equality join predicates [1].
- Unnecessary outer joins [1].
- Redundant predicates [1].
- Multiple aggregations with DISTINCT.
- Build queries dynamically unless necessary.
- Techniques that use full-table scan: functions that don’t use/perform poor on indexes, wildcards at the beginning of a word.
- Use more attributes/records than needed (particular case: Use * instead of specifying the attributes).
- Using nested views.
- Rely entirely on the code created by wizards and other automation tools.
- UNION unless really needed: use UNION ALL.
- UNIONS instead of conditional-base code (e.g. CASE, DECODE) or self-joins.
- Using temporary tables.
- Server side cursors.
- Procedural queries (e.g. loops, cursors) rather than using set-based queries.
- Redundant logic/code.
- Negations on constraints.
- Code facilitating SQL injection: use parameterized objects.
- Hard-coding values.
- Undocumented functionality.
- Use GROUP BY on final sub-query when it could be used in a sub-query.
- Multiple self-joins/joins to same table instead of GROUP BY.
- Repetitive calls to the same function and same parameters.
- Use constants in ORDER BY clause.
- Create too many versions of the same query.

Note:
Please note that there are situations and situations, a technique not recommended in general could prove to offer better performance than alternatives (e.g. : recursive simulation + temporary table vs. hierarchical self-joins on SQL Server 2000 ), while for others there are several aspects that need to be considered, for example the trade in performance vs. reusability. Even if most of the database vendors adhere to SQL ANSI standard, in the end each functionality could be implemented differently and vendors could provide additional functionality, therefore could be considered specific best practices for each functionality/vendor.

References:
[1] IBM (2010) Best Practice - Writing and Tuning Queries for Optimal Performance [link]
[2] Oracle (2009) Cleaning Up PL/SQL Practices, by S. Feuerstein [link]

Resources:
Microsoft TechNet. (2010). SQL Server - Best Practices [link]

27 January 2010

🧭Business Intelligence: Cognos & SQL Server 2008 - The First Report (Test Drive)

Yesterday evening I installed on my computer the IBM Cognos Express (V9) software, the newest Cognos version, available with a trial period of 30 days. The first attempt to install it was not errorless though as I had to observe several attempts later when I got the same error, the error was not fatal. So, once the instalation complete, the Express Manager started, finding out that I have to install manually the Manager, Advisor, Reporter and Xcelerator - quite easy to do.

Now I was ready to test the tool and create several reports, for this needing to create first a data source - a connection to an existing database with several tables on which the reports will be based. SQL Server 2008's AdventureWorks database seemed to be adequate for this test, in plus I know there are many SQL Server professionals already using it for testing or tutorials.

Step 1: Creating the SQL Server account

Reporting tools need in general only read access to data sources and as a good practice to enforce security and address the special requirements the reports are coming with, so it makes sense to create a new User for this. Open Microsoft SQL Server Management Studio and from the root navigate to Security/New/Login…, this action brining Login – New window in which must be filled

a. the Login name (e.g. cognosuser)
b. the authentification – because Cognos doesn’t supports Windows-based authentication, choose SQL Server authentication, enter the Password and confirm it. By default the ‘Enforce password expiration’ is checked, though this option is not needed for the current requirement so unchek it.
c. Select the Default database, the database you’re intending to base your reports on (e.g. AdventureWorks)
d. Select the ‘Default Language’ (e.g. English)

Navigate to ‘Use Mapping’ tab and in ‘User mapped to this login’ choose the databases you would like to give access (e.g. AdventureWorks) and for each database selected you’ll have to select the schemas in scope (e.g. HumanResources, Person, Production, Purchasing and Sales), while in’Database role membership for’ select the db_datareader database level role name, then you can click ‘Ok’.

It’s always a good idea to test the new account before attempting to use it from an external application, this allows catching the eventual issues early in the process, saving your and others’ time!

Step 2: Creating a Data Source

Start the IBM Cognos Express Manager and from Data click on ‘Add…’ button, action that will bring the ‘Enter Data Source Information’ in which you’ll have to select the ‘Data Source Type’, in this case Microsoft SQL Server (SQL 2005 Native Client), fill the ‘Name’ (e.g. AdventureWorks), ‘Server Name’ (here comes your SQL Server name), ‘Database Name’ (e.g. AdventureWorks ), ‘User ID’ (e.g. cognosuser) and password (e.g. something you must guess). As can be seen from the below image, there is also a Databae Object control, which will be populated once you entered the before mentioned information and clicked on ‘Retrieve Database Objects’ button. Select one of the database objects (e.g. AdventureWorks ) and click ‘Ok’, Cognos creating the Data (Source) in the background.

Step 3: Creating a Report

Instead of using Report Studio, the standard tool for designing reports, for simple reports or queries could be used instead the Query Studio, an easy-to-use authoring tool, the creation of a report resuming to a simple drag and drop. For this you’ll have to open the Query Studio from the Launch item found on the top menu, this action opening a new browser window, in a next step following to select the package (actually the data source) you want to use. In ‘Insert Data’ mode select a table from the list (e.g. Location) and drag and drop it into the right section, all table’s columns being shown together with a sample set of data.

You could double click on each column and give it a meaningful name – is a good practice to separate the words with spaces though not really necessarily. By clicking on report’s Title you could change its actual value and add a Subtitle if necessary. In ‘Change Layout’ you could change report’s layout by applying different Font, Border and Conditional Styles, much like in Excel, show row numbers by clicking the ‘Show row numbers’ from Change Layout/Set Web Page Size, etc. Save the report once you’ve done the changes, the report will look something like the one from the below image:

Step 4: Running the Report

From ‘Run Report’ Menu item you could choose to run the report mainly in three modes – ‘Run with All Data’, ‘Preview with Limited Data’ or ‘Preview with No Data’. Additionally you can export the report to PDF (Portable Document Format), several versions of Excel (2000, 2002 and 2007), CSV (Comma Separated Values) and XML (eXtensible Markup Language).

Happy coding!

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

Data Management Series

Barry Boehm defines structuredness as 'the degree to which a system or component possesses a definite pattern of organization of its interdependent parts' [1], which transposed to data refers to the 'pattern of organization' that can be observed in data, mainly the format in which the data are stored at macro-level (file or any other type of digital containment) or micro-level (tags, groupings, sentences, paragraphs, tables, etc.), emerging thus several levels of structure of different type.

From the various sources in which data are stored - databases, Excel files and other types of data sheets, text files, emails, documentation, meeting minutes, charts, images, intranet or extranet web sites, can be derived multiple structures coexisting in the same document, some of them quite difficult to perceive. From the structuredness point of view data can be categorized as structured, semi-structured and unstructured.

In general, the term structured data refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations. Unstructured data refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined, while semi-structured data refers to islands of structured data stored with unstructured data, or vice versa.

From this perspective, according to [3], database and file systems, data exchange formats are example of semi-structured data, though from a programmers’ perspective the databases are highly structured, and same for XML files. As also remarked by [2] the terms of structured data and unstructured data are often used ambiguously by different interest groups, in different contexts – web searching, data mining, semantics, etc.

Data structuredness is important especially when is considered the processing of data with the help of machines, the correct parsing of data being highly dependent on the knowledge about the data structure, either defined beforehand or deducted. The more structured the data and the more evident and standardized the structure, the easier should be to process the data. Merrill Lynch estimates that 85% of the data in an organization are in unstructured form, most probably this number referring to semi-structured data too. To make such data available in a structured format is required an important volume of manual work combined eventually with reliable data/text mining techniques, a fact that reduces considerably the value of such data.

Text, relational, multidimensional, object, graph or XML-based DBMS are in theory the most easily to process, map and integrate though that might not be so simple as it looks given the different architectures vendors come with, the fact that the structures evolve over time. To bridge the structure and architectural differences, many vendors make it possible to access data over standard interfaces (e.g. ODBC), though there are also systems that provide only proprietary interfaces, making data difficult to obtain in an automated manner. There are also other types of technical issues related mainly to the different data types and data formats, though such issues can be easily overcome.

In the context of Data Quality, the structuredness dimension refers to the degree the structure in which the data are stored matches the expectations, the syntactic set of rules defining it, being considered across the whole set of records. Even a minor inadvertence in the structure of a record could lead to processing errors and unexpected behavior. The simplest example is a delimited text file - if any of the character sets used to delimit the structure of the file is available in the data itself, then there are high chances that the file will be parsed incorrectly, or the parsing will fail unless the issues are corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Barry W Boehm et al (1978) "Characteristics of software quality"
[2] The Register (2006) "Structured data is boring and useless", by D. Nortfolk (link)
[3] P Wood (?) "Semi-structured Data"

23 January 2010

🧊🎡Data Warehousing: ETL (Part I: It's All about Synchronization)

Data Warehousing Series

In a previous post on ETL: SSIS packages vs. SQL code I mentioned that there are three types of synchronization - of scope, business logic and data. The first type targets synchronizing the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data.

The synchronization of scope is achieved by enforcing the same set of constraints to related data elements with the purpose of keeping the referential integrity between the various data elements. They involve parent-child relations like Purchase orders (PO) Headers/Lines or reference/referenced relations like PO Lines and Products or Vendors. Such relationships can involve datasets coming from distinct systems, usually involving different architecture - e.g., Products or Bill of Materials master data that could come from a Product Management Information (PMI) system.

The synchronization of business logic is usually a consequence of the synchronization of scope and concerns the set of business rules that must be applied in the logic used. For example, is need to consider Products or Vendors only for the open POs. Therefore, that would involve the duplication of logic for open POs across multiple database objects, logic that needs to be synchronized accordingly. One can reference directly the object encapsulating the logic for open POs, though that would create a recursive reference as the logic references the Products and/or Vendors as well. With one or two tricks this can be avoided.

Synchronizations can involve local logic, e.g. using exchange rates conversions and other transformations where a unique user-defined function could allow achieve this. No matter of the techniques used, one must make sure that all the logic is kept in synch.

The synchronization of data requires working with the same unaltered data in the various threads of processing, either SSIS packages or SQL code; that's quite simple as concept, though not always straightforward to achieve and occasionally ignored. For this purpose, it makes sense to load the data into an intermediary database or staging area, even for slowly changes data sources like data warehouses, and base the business logic on the respective local dataset(s) rather than loading the data repeatedly for each package from the source, which can also involve considerable additional network traffic. Why the alternative is not the best approach?

Supposing that two packages A and B are scheduled to run at the same time, even if the requests are sent simultaneously to the database, the simultaneously is relative, because most of the time the requests are queued, in general following to be processed in the order they came, though this depends on legacy system’s architecture and settings. Supposing that there is some time elapsed between the times the two requests are processed, there are good chances that one record was created, deleted, or updated in between.

The impact of changes in data could be minimum though strange situations might result with unpredictable impact. There are chances to find the issue when loading the data in the destination system, this if adequate validation is performed, though there are good chances for the issues to remain undiscovered until later, with all consequences resulting from this. Therefore, one should also build validation logic separately when feasible.

It should be targeted to cover all three types of synchronization even if it complicates the design of the solutions, as it's recommended in general to apply defensive architecting/programming. The increase in complexity is relative if one considers the effort needed for troubleshooting, plumbing or even redesign that needs to be done to fix the issues.

Previous Post <<||>> Next Post

Created: Jan-2010, Last Reviewed: Mar-2024

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

I’ve been working with DTS (Data Transformation Services) packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful in handling ETL (Extract, Transform, Load) tasks, especially when importing and exporting data between multiple data sources. The functionality provided by the DTS platform was basic, though it could be extended in custom packages with code developed in VB or by using ActiveX tasks in VBScript/Jscript, when the SQL-based logic was not enough. In addition, all three types of coding could make calls to .dlls, and thus operating system, vendor or in-house built libraries or simple functions could be (re)used.

Starting with SQL Server 2005, DTS was replaced by the SSIS (SQL Server Integration Services) component, becoming with SSRS (SQL Reporting Services) and SSAS (SQL Server Analysis Services) integrant part of Microsoft's data platform. Besides providing a new architecture, SSIS extended the functionality previously available, bringing more flexibility in constructing the packages, their elements and data manipulation. SSIS became with each new version of SQL Server (2008, 2012, 2014, 2016, 2017, 2019) a powerful ETL tool that can be easily used for Data Warehousing, Data Migration or Data Integration projects.

In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the (ETL) package, or use the package mainly for loading data from the various sources and have the business logic in the database as SQL-based code. As always, each of the two philosophies has its own advantages and disadvantages, though I would consider a third philosophy – design for performance, reuse and maintainability, resulting thus a hybrid between the first two mentioned philosophies.

There are several other factors that need to be considered when building ETL solutions - synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no perfect receipe, no architecture matching all requirements, each solution coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements and problem at hand.

Performance

Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequential or workflow-based processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches via a prototype, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore it's important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

I find that the business logic implemented in SQL code is much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, even by non-technical people, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram) and need a higher level of expertize.

Security

Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and database engines could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment

Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and the database engines are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

The learning curve of technologies is always an important factor that needs to be considered in development, as it reflects how much time an average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both - SSIS architecture, respectively SQL-based programming - though it’s not easy to acquire both problem-solving mindsets. A SSIS developer usually attempts using as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. In the end, it’s important to know how to balance between the two.

Previous Post <<||>> Next Post

19 January 2010

🧭Business Intelligence: Enterprise Reporting (Part III: Levels of Accuracy)

Business Intelligence

Correlated with the level of detail (aka granularity), another aspect that needs to be considered in reports is results’ levels of accuracy – how much the data reflect the reality at each level of detail. Because the integration between modules of the same or different systems brought some of the attributes from one module into the other (e.g. Document Numbers, UIDs, Dates), the reports can consider the respective attributes in queries without involving the modules they come from. Thus for example in order to create an AP (Account Payables) Invoices report, could be sufficient to use the GL (General Ledger) Date stored in AP without using the GL data directly; on the other side, a more accurate report is obtain when using the AP and GL data, such a report could include also the manual postings made in GL and not available in AP.

Typically the data from one module that are stored in other modules, should be in synch, though there could be exceptions and exceptions. In the end it is developer’s task to understand the level of accuracy/detail the users need; in general it doesn’t always make sense to provide the highest level of accuracy/detail when the user needs only some rough number, as higher level of accuracy/details equate with more effort and more tables added to a report.

When considering the different levels of accuracy/detail within reports, given two modules/entities A and B, we might end up created a set of reports with:
a.  all records from A
b.  all records from B;
c.  all records from A and matching records from B, and vice-versa;
d.  all records from A and aggregated data from B when the level of detail is changed.
e.  aggregated data from A vs. aggregated data from B in case of many-to-many cardinality, the aggregation being made at a level of detail in which the amounts are not duplicated.
f.  mismatches between A and B for the same scope (it can be shown at different levels of details).

These scenarios would allow Users to choose the report with the needed level of detail/accuracy, though covering all existing scenarios could be quite expensive, not to neglect the fact that there are reports spanning more than 2 modules. On the other side, two reports for A and B would be sufficient as long as the reference attributes between modules are provided, falling in Users’ task to match and aggregate the data, though also this alternative could prove to be problematic because of the volume of data, synchronization issues between data sets or lack of adequate skill set, and all related issues. The first solution is the ideal, the second is workable, though in the end it depends what makes the users happy.

Previous Post <<||>> Next Post

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

Data Management Series

Referential integrity, when considered as data quality dimension, refers to the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value). Typically, that's assured by design in Database Management Systems (DBMS) using a feature called referential integrity that defines an explicit relationship between the two tables that makes sure that the values remain valid during database changes. Thus, when a record is inserted or updated and a value is provided for the reference value, the system makes sure that the referenced value is valid, otherwise it throws a referential integrity error. A similar error is thrown when one attempts to delete the record with the referenced value as long is referenced by a table on which the relationship was explicitly defined.

Using referential integrity is a recommended technique for assuring the overall integrity of the data in a database, though there are also exceptions when that's not enforced for all tables (e.g. data warehouses) or only for exceptions (e.g. interface tables where records are imported as they are, attribute whose values references data from multiple tables). Therefore, even if there are tables with the referential integrity enforced, don't make the assumption that it applies to all tables!

In relational DBMS there are three types of integrity mentioned – entity, referential and domain integrity. Entity integrity demands that all the tables must have a primary key that contains no Null values. The referential integrity demands that each non-null value of a foreign key must match the value of a primary key [1], while the domain integrity demands that the type of an attribute should be restricted to a certain data type, the format should be restricted by using constraints, rules or range of possible values [2].

Even if not mandatory, all three types of integrity are quintessential for reliable relational databases. When the referential integrity is not enforced at database level or at least in code, when a record from a table is deleted and a foreign key it’s still pointing to it, fact that could lead to unexpected disappearance of records from the system’s UI even if the records are still available.

During conversions or data migrations is important to assure that the various sets loaded match the referential and domain integrity of the database in which the data will be loaded, otherwise the records not respecting the mentioned type of integrity will be rejected. The rejection itself might not be a problem for several records, though when it happens at large scale, then the situations changes dramatically, especially when the system gives no adequate messages for the cause or rejection. A recommended approach is to assure that the scope is synchronized between the various data elements, and that the referential integrity of datasets is validated before the data are loaded in the destination database.

There are several sources (e.g. [3]) that consider Codd’s referential integrity constraint as a type of consistency, in the support of this idea could be mentioned the fact that referential integrity could be used to solve data consistency issues by bringing the various LOV in the systems. Referential integrity is mainly an architectural concept even if it involves the 'consistency' of foreign key/primary key pairs.

Note:
Expect the unforeseeable! It’s always a good idea to check whether the referential integrity is kept by a system – there are so many things that could go wrong! In data migration solutions, data warehouses and more general analytical solutions is a good idea to have in place mechanisms that check for this kind of issues.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Halpin. T. (2001) Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann Publishers. ISBN 1-55860-672-6.
[2] MSDN. 2009. Data Integrity. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms184276.aspx (Accessed: 18 January 2009)

[3] Lee Y.W., Pipino L.L., Funk J.D., Wang R.Y. (2006) "Journey to Data Quality", MIT Press. ISBN: 0-262-12287-1

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]).

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal.

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management

[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"

[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

17 January 2010

🗄️Data Management: Data Quality Dimensions (Part IV: Accuracy)

Data Management Series

Accuracy refers to the extent data is correct, matching the reality with an acceptable level of approximation. Correctness, the value of being correct, same as reality are vague terms, in many cases they are a question of philosophy, perception, having a high degree of interpretability. However, in what concerns data they are typically the result of measurement, therefore a measurement of accuracy relates to the degree the data deviate from physical laws, logics or defined rules, though also this context is a swampy field because, utilizing a well-known syntagm, everything is relative.

From a scientific point of view, we try to model the reality with mathematical models which offer various level of approximation, the more we learn about our world, the more flaws we discover in the existing models, it’s a continuous quest for finding better models that approximate the reality. Things don’t have to be so complicated, for basic measurements there are many tools out there that offer acceptable results for most of the requirements, on the other side, as requirements change, better approximations might be required with time.

Another concept related with the ones of accuracy and measurement systems is the one of precision, and it refers to degree repeated measurements under unchanged conditions lead to the same results, further concepts associated with it being the ones of repeatability and reproducibility. Even if the accuracy and precision concepts are often confounded a measurement system can be accurate but not precise or precise but not accurate (see the target analogy), a valid measurement system targeting thus both aspects. Accuracy and precision can be considered dimensions of correctedness.

Coming back to accuracy and its use in determining data quality, typically accuracy it’s strong related to the measurement tools used, for this being needed to do again the measurements for all or a sample of the dataset and identify whether the requested level of accuracy is met, approach that could involve quite an effort. The accuracy depends also on whether the systems used to store the data are designed to store the data at the requested level of accuracy, fact reflected by the characteristics of data types used (e.g. precision, length).

Given the fact that a system stores related data (e.g. weight, height, width, length) that could satisfy physical, business of common-sense rules could be used rules to check whether the data satisfy them with the desired level of approximation. For example, being known the height, width, length and the composition of a material (e.g. metal bar) could be determined the approximated weight and compared with entered weight, if the difference is not inside of a certain interval then most probably one of the values were incorrect entered. There are even simpler rules that might apply, for example the physical dimensions must be positive real values, or in a generalized formulation - involve maximal or minimal limits that lead to identification of outliers, etc. In fact, most of the time determining data accuracy resumes only at defining possible value intervals, though there will be also cases in which for this purpose are built complex models and specific techniques.

There is another important aspect related to accuracy, time dependency of data – whether the data changes or not with time. Data currency or actuality refers to the extent data are actual. Given the above definition for accuracy, currency could be considered as a special type of accuracy because when the data are not actual then they don’t reflect reality. If currency is considered as a standalone data quality dimension, then accuracy refers only to the data that are not time dependent.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

Working with data mainly from frontend (User Interface), in general Users have limited knowledge on how the data are physically stored in the various systems existing in an organization. When a new report is needed they point out the attributes they know from the screens they are working with, falling in developers’ duties to figure out whether the “soup of attributes” makes really sense and find a workable solution. Once the developer has identified the attributes and the tables they are stored in, he/she can go on and create the query/queries on which the report will be based upon. For this task is important to know how the various tables relate to each other, in other words knowing the attributes which can be used to link the tables or which is the relational path to link the table, and relations’ cardinality, which reflects how the number of records of a table or data set changes when is joined with another table or data set.

Between two tables and extensively two data sets the relations could have any of the four types of cardinality – many-to-many (represented as m:n), one-to-one (1:1), one-to-many (1:n) and the reverse many-to-one (n:1). It could be given a definition for each of the cardinality types, though it’s easier to remember that in the x-to-y compounds, between two tables or data sets A and B, x has the value ‘one’ when the number of references from table B to any record from A is maximum 1 (there could be records not referenced at all), and ‘many’ when at least a record could be referenced more than once; the same logic applies for y but inversing the tables’ perspective.

The level of detail (LOD) of a report (aka data granularity) is directly correlated with the changes in cardinality - by adding a data set to an existing data set, if the cardinality of one-to-many and many-to-many is implied then the level of detail changes too. In other words if a record in a given data set is referenced more than once in the new added table then the LOD changes. A simple example of change of LOD occurs in parent-child/master-details relations. For example if the Invoice Lines are added to a data set based on Invoice Header then the LOD changes from Invoice Headers to Invoice Lines. If the Payments are added to the resulted data set then the LOD changes from Invoice Lines to Payments only if there are multiple Payments for an Invoice (one-to-many or many-to-many cardinalities), otherwise LOD remains the same.

Summarizing, the level of detail of a report is the lowest level at which a cardinality change occurs at entity level. It can be discussed about lower or higher LOD in relation to a given level of detail, for example Invoice Payments have a lower LOD than Invoice Lines, while Invoice Header a higher LOD.

Why is it necessary to introduce the LOD especially when considering a relatively complex notion as cardinality?! It worth defining it mainly because the term is used when defining/mitigating reporting requirements, its use being intuitive rather than well-defined and understood. When creating reports it’s important to find the adequate LOD for a report and know what methods can be used in order to pull data that would normally change reports’ LOD without actually changing reports’ LOD.

The limitations imposed by the need to report at a certain LOD higher than the one implied by the row data can be overcome with the help of aggregate functions used with GROUP BY constructs. Such constructs make it possible to bring into a report data found at lower LOD than the reporting LOD by grouping the data based on a set of attributes. The aggregate functions provide functionality to calculate for a set of values the maximum, minimum, sum, count of records, standard deviation, and other statistical values, all of them working on numeric data types, while maximum/minimum and count of records work also with dates or alphanumeric data types.

For example using aggregate functions the Invoice amounts can be aggregated and shown at Invoice Header level, however, in contrast, it’s not possible to do the same with quantities, because typically each Invoice Line refers to a specific Product, and as apples can’t be counted with peaches, this means that in order to see the quantities, the level of detail has to be changed from Invoice Header to Invoice Line. The same technique could be applied to similar parent-child (master-details) relations covering one-to-many or one-to-one cardinality, and also many-to-many relations that involve higher reporting complexity.

Direct many-to-many relations are seldom, involving mainly data sets, the attributes used in relation appearing more than once in each dataset (at least once). Taking an example from ERP world, there could be multiple Receipts and Invoices for the same Purchase Order, thus if there is no direct match between Receipts and Invoices to identify uniquely which Invoice belong to each Receipt, a report involving all the three entities could be barely usable, records being duplicated (e.g. 2 Invoices vs. 3 Receipts result in 6 records in a such scenario). For such reports to be usable the LOD needs to be change at a higher level at least on one side, thus reconsidering the mentioned example a report could show the Purchase Order details and aggregating the Invoice & Receipt Quantities/Amounts at Purchase Order level, or show detailed Invoices with aggregated Receipt information, respectively detailed Receipts with aggregated Invoice information.

Unfortunately even if the aggregate functions are quite handy they have their own limitations, being difficult to use them in order to answer to questions like “what was the last product sold for each customer”, “which was latest invoice placed for each customer”, at least not without considerable effort from developer’s side. Fortunately database vendors implemented their own more specialized type of aggregate functions - analytic functions in Oracle and window functions in SQL Server, they allowing modeling such questions with a report. The downside for developers is that each database vendor comes with its own philosophy, so techniques and features working in one database might not work in another.

Previous Post <<||>> Next Post

15 January 2010

🕋Data Warehousing: Dimension Hierarchy (Definitions)

"An arrangement of members of a dimension into levels based on parent-child relationships, such as Year, Quarter, Month, and Day or Country, Region, State or Province, and City. Members in a hierarchy are arranged from more general to more specific." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A navigation path that allows the user to move from summarized to detailed information with each level of the hierarchy represented by a different attribute." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A series of master-detail relationships within a dimension. Many front-end tools require hierarchy definitions in order to support drilling features. Some aggregate navigation and construction tools, such as Oracle's materialized views, require explicit declaration of a hierarchy if a dimension is to be partially summarized by an aggregate fact table." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A logical tree structure that organizes the members of a dimension such that each member has one parent member and zero or more child members." (Microsoft, "SQL Server 2012 Glossary", 2012)

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

Data Management Series

Even if often Data Cleansing or Data Scrubbing is considered synonymous to Data Cleaning, it actually includes the Data Validation and Data Cleaning process, where Data Validation is the process of checking data quality against a set of rules (referred as data quality rules) in a given set of data, and Data Cleaning or Data Correction is the process of correcting the data quality issues identified in Data Validation. Data Cleansing consists in general of the following steps:

Step 1: Defining the scope

The scope definition resumes at the identification of elements, attributes and the records that need to be cleaned. If is intended only to improve overall data quality and nothing else, the focus will be mainly on the data elements and attributes perceived as having low data quality. For data conversion the elements in scope are defined based on the requirements of the destination system and overall business requirements are defined the attributes in scope, an exercise that needs to be performed is the data mapping between the destination on one side and the legacy system(s) and the other data sources (e.g. Excel, MS Access files) on the other side.

From case to case in scope for cleansing might be all the records available in the legacy system(s) for a given set of data elements and attributes, or only a fraction of them, usually the focus is mainly the active and open records, and eventually some historical records. By active records are referred those records not cancelled or disabled, either master and transactional data , by open records the (transactional) records not marked as closed, records on which an action is still pending (e.g. pending Invoice Receipt, Payment, etc.), while by historical records are referred the already closed transactional data (e.g. Purchase Orders, Sales Orders, etc.).

Given the importance of master data for the business they are often the starting point for cleaning, they being also the data elements with the longest issue resolution, the transactional data being in theory easier to clean. The two types of data need to be synchronized the referent data related to the transactional data in scope needs to be loaded too.

Step 2: Defining the data quality rules and error types.

The data quality rules are defined for the attributes in scope and the data quality dimensions considered that apply, typically data accuracy, completeness, conformity, consistency, duplication and referential integrity. The issues are eventually further categorized and prioritized.

Step 3: Validating the data quality rules

The data quality rules are validated against the actual data sets, the output of the rules are consolidated in reports that follow to be interpreted. The number of errors are reported and stored for historical trending in case several iterations are needed for data cleaning.

Step 4: Identifying the data quality issues.

The reports are reviewed by the data owners, the issues being identified and the rules validated, further steps being performed in order to identify the root causes for the discovered issues. This usually leads to the calibration of data quality rules as new scenarios (e.g. exceptions, new error types) can be discovered.

Step 5: Correcting the data quality issues.

The issues are mitigated and corrected, issues’ resolution time, the time needed to correct the issues, is directly dependent on the volume of issues, their complexity, the number of resources allocated to the task, on whether the issues are corrected in the source or on a copy of the data, on whether the cleansing is done manually or specific software tools are used. Specific Data Cleaning software tools can help decreasing considerably the volume of manual work involved during the process, being preferred to use automatic and semiautomatic data cleansing methods whenever is possible.

Step 6: Validating the changes against the same data quality rules.

Once the issues are corrected the data sets are validated against the same rules used at Step 3, the Steps 3 to 5 being repeated until the expected data quality level is achieved. As already highlighted in a previous post, Data Cleansing is not a one-time endeavor, but an iterative process, new iterations being requested by changes occurred in rules or in scope, by the sequencing/scheduling of cleansing work or by reporting frequency (centered or not along the important milestones).

Step 7: Documenting/Reporting error results and validation outcomes, if needed.

Starting with data elements, attributes, and the logic used to select the records in scope for cleaning, and ending with the data quality reports with the issues and the improvement trends it makes sense to document everything and summarize it in a set of best practices and lesson learned sessions.

Step 8: Modify processes/procedures to reduce issues occurrence.

There are chances to discover that the errors found during validation could be avoided by modifying the existing processes, procedures or standards, or enforcing new validation rules in the legacy systems. When the changes are complex might be started even individual projects to address them.

13 January 2010

🧭Business Intelligence: Enterprise Reporting (Report Types)

Have you ever wondered how many types of reports are there? In Information Systems (IS) nomenclature I found the following different types of reports considered:

Standard reports – reports that are coming with a software application/package, as opposed of custom reports, reports created on customers’ request.

Ad-hoc reports – reports built usually to satisfy one-time requests, though they can easily evolve to a standard report.

Graphic reports- reports providing graphical visualization of data with the help of charts

Transactional reports (OLAP reports) – reports built in transactional systems, containing up-to-date data.

Analytic reports (OLAP reports) – reports built in an OLAP environment, containing data desynchronized from the OLTP environment, the data being refreshed on a periodic basis.

Predictive reports – reports relying on powerful DM models and predictive technique.

Parameterized reports – reports whose output is based on a set of predefined parameters.

Linked reports – reports that provide an access point to other reports.

Snapshot reports – reports that contain data retrieved at a specific point of time.

Cached reports – reports saved in order to improve the performance by reducing the number of requests to the database/report engine.

Click-through reports – reports whose display is based on interactive data selection

Drilldown reports – a set of reports on the same topic showing data at different levels of details, the navigation being made from higher to lower level of details.

Drill-through reports – reports accessible through a hyperlink from the original report.

Sub-reports – a report contained in the body of another report, allowing for example the display of parent/child or header/lines relations.

Metric-based reports – reports supposed to encompass the various types of business metrics; they can be further categorized in:
Health Metrics – reports designed to show the health of a system in terms of its usage and the adherence to the processes defined.
Growth Metrics - reports designed to show the growth of a system in terms of data, transaction or amount volume.
KPI (Key Process Indicator) reports – reports designed to measure an organization progress towards set organizational goals.
LPI (Lean Process Indicator) reports – reports designed to reflect business’ progress toward Lean Management organizational goals.

Dashboards – reports offering an eye-bird view of several key performance indicators.

Another characterization of reports can be based on the functional department for which the report is created, thus we can speak of financial reports, operational reports, sourcing reports, (global) supply chain reports, marketing reports, maintenance reports, etc.

Note:
The term of financial report might refer in special to financial statements.

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing.

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time.

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

SQL Troubles

Pages

28 January 2010

💎SQL Reloaded: Query Writing I (Best Practices)

27 January 2010

🧭Business Intelligence: Cognos & SQL Server 2008 - The First Report (Test Drive)

25 January 2010

🗄️Data Management: Data Quality Dimensions (Part VII: Structuredness)

23 January 2010

🧊🎡Data Warehousing: ETL (Part I: It's All about Synchronization)

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

19 January 2010

🧭Business Intelligence: Enterprise Reporting (Part III: Levels of Accuracy)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part VI: Referential Integrity)

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

17 January 2010

🗄️Data Management: Data Quality Dimensions (Part IV: Accuracy)

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

15 January 2010

🕋Data Warehousing: Dimension Hierarchy (Definitions)

14 January 2010

🗄️Data Management: Data Cleansing (Part I: An Introduction)

13 January 2010

🧭Business Intelligence: Enterprise Reporting (Report Types)

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

About Me