SQL Troubles: data quality

Showing posts with label data quality. Show all posts

03 April 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 157: The Transformative Power of High‑Quality Training Data in AI)

Prompt Engineering Series

Prompt: "write a post of 600 words on the impact of consistent and high‑quality training data on AI"

Introduction

Artificial Intelligence (AI) has advanced at a breathtaking pace, but beneath every impressive model - whether it’s diagnosing diseases, generating natural language, or predicting customer behavior - lies a simple truth: AI is only as good as the data it learns from. While algorithms often get the spotlight, the real engine of progress is the quality and consistency of the training data that shapes them. When data is clean, representative, and reliable, AI systems flourish. When it’s inconsistent or flawed, even the most sophisticated models struggle.

Why Data Quality Matters More Than Model Complexity

At its core, machine learning is pattern recognition. Models learn by identifying relationships in the data they’re fed. If that data is noisy, biased, or incomplete, the patterns the model learns will be distorted. This leads to:

Lower accuracy
Unpredictable behavior
Poor generalization to real‑world scenarios

High‑quality data, on the other hand, gives models a clear, stable foundation. It reduces ambiguity, sharpens decision boundaries, and allows the model to focus on meaningful signals rather than statistical 'static'. In many cases, improving data quality yields bigger performance gains than tweaking model architecture.

Consistency: The Unsung Hero of Reliable AI

Consistency in training data is just as important as quality. When data is collected or labeled using different standards, the model receives mixed messages. Imagine teaching a child math using three different definitions of multiplication - they’d be confused, and so is your model.

Consistent data ensures:
Uniform labeling practices
Aligned definitions and categories
Stable distributions across time

This is especially crucial in domains like healthcare, finance, and autonomous systems, where inconsistent data can lead to dangerous or costly errors.

Better Data = Better Learning

When training data is both high‑quality and consistent, AI models learn faster and more effectively. They require fewer training cycles, less computational power, and less manual intervention. The model’s internal representations become more coherent, which improves:

Accuracy
Robustness
Explainability

This is why organizations that invest in data governance, annotation standards, and quality control often outperform those that focus solely on model development.

Reducing Bias and Increasing Fairness

Bias in AI is almost always a data problem. If certain groups or scenarios are underrepresented - or represented inaccurately - the model will inherit those imbalances. High‑quality data practices help mitigate this by ensuring:

Diverse and representative samples
Balanced class distributions
Transparent labeling criteria

Fairness isn’t just a moral imperative; it’s a performance issue. Models trained on biased data are less reliable and more prone to failure when deployed in diverse environments.

The Competitive Advantage of Data Excellence

In today’s AI‑driven landscape, companies that treat data as a strategic asset gain a significant edge. High‑quality training data leads to:

Faster deployment cycles
Lower maintenance costs
More trustworthy AI systems
Better user experiences

It also enables continuous improvement. When new data is collected with the same standards as the old, models can be retrained seamlessly, keeping them aligned with evolving real‑world conditions.

The Bottom Line

Consistent, high‑quality training data isn’t just a technical requirement - it’s the foundation of responsible, effective, and scalable AI. As models grow more powerful, the importance of data quality grows with them. Organizations that invest in rigorous data practices today are building the AI systems that will lead tomorrow.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

25 March 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 148: How Narrow Data Limits Exposure to Ethical Diversity in AI)

Prompt Engineering Series

Prompt: "write a post of 600 words on how Narrow Data Reinforces Historical Inequities in AI"

Introduction

Artificial Intelligence (AI) systems increasingly participate in decisions and interactions that carry ethical weight - moderating content, assisting with customer support, guiding recommendations, and shaping how people access information. Yet AI does not possess moral intuition or cultural awareness. Its 'ethical understanding' is entirely learned from patterns in the data it is trained on. When that data is narrow - reflecting only a limited set of cultural norms, moral frameworks, or social values - the model’s ability to navigate ethical diversity becomes shallow and incomplete. Narrow data doesn’t just reduce accuracy; it restricts the model’s capacity to behave responsibly across different communities and contexts.

1. Narrow Data Embeds a Single Ethical Perspective

Ethical norms vary widely across cultures, religions, and societies. What one community considers respectful, another may interpret differently. When AI is trained on narrow datasets that reflect only one cultural or ethical viewpoint, it internalizes that perspective as the default. This can lead to:

Misjudging what is considered harmful or acceptable
Applying one moral framework to all users
Failing to recognize culturally specific sensitivities

The model’s ethical 'lens' becomes monocultural, even when serving a global audience.

2. Narrow Data Misses Nuanced Moral Reasoning

Ethical diversity isn’t just about different values - it’s about different ways of reasoning. Some cultures emphasize individual autonomy, others prioritize collective well‑being. Some focus on intent, others on consequences. Narrow data limits exposure to these variations, causing AI to:

Oversimplify complex moral situations
Misinterpret user intent
Apply rigid rules where nuance is needed

Without diverse examples, the model cannot learn how ethical reasoning shifts across contexts.

3. Narrow Data Reinforces Dominant Narratives

When datasets are dominated by one demographic or cultural group, AI learns the ethical assumptions embedded in that group’s narratives. This can lead to:

Marginalizing minority perspectives
Treating dominant values as universal truths
Misrepresenting or ignoring alternative viewpoints

AI becomes a mirror of the majority rather than a tool that respects the full spectrum of human experience.

4. Narrow Data Reduces Sensitivity to Ethical Risk

AI systems rely on training data to recognize harmful or sensitive situations. If the data includes only a narrow range of ethical dilemmas, the model may fail to detect:

Subtle forms of discrimination
Culturally specific slurs or microaggressions
Indirect threats or coercive language
Ethical issues unique to certain communities

The model’s ability to identify risk becomes inconsistent and incomplete.

5. Narrow Data Limits Fairness Across Diverse Users

Fairness in AI requires understanding how different groups communicate, express emotion, and interpret social norms. Narrow data reduces the model’s ability to:

Respect cultural differences
Interpret diverse communication styles
Provide equitable responses across demographics

This leads to uneven performance - some users receive thoughtful, context‑aware responses, while others encounter misunderstandings or bias.

6. Narrow Data Constrains Ethical Guardrails

Even with safety mechanisms in place, AI relies on training data to know when to apply them. If the data lacks diverse examples of sensitive or high‑stakes situations, the model may:

Miss opportunities to provide supportive guidance
Apply safety rules inconsistently
Fail to recognize when a user needs extra care

Ethical guardrails are only as strong as the data that informs them.

Closing Statement

Narrow data doesn’t just limit what AI knows - it limits how ethically and socially aware it can be. Ethical diversity is essential for building AI systems that serve global, multicultural communities with respect and fairness. When training data reflects only a narrow slice of human values, the model’s ethical understanding becomes shallow, biased, and incomplete. By investing in diverse, representative datasets and thoughtful design practices, we can help AI navigate ethical complexity with greater sensitivity and responsibility - ensuring it supports, rather than undermines, the rich diversity of human moral experience.

Previous Post <<||>> Next Post

15 March 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 138: Understanding Gender Disparities in AI Models and How to Minimize Them)

Prompt Engineering Series

Prompt: "write a post of 600 words on the causes of gender disparities in AI models and how can their occurrence be minimized"

Introduction

Artificial Intelligence (AI) has become deeply embedded in modern life, shaping decisions in hiring, healthcare, education, finance, and everyday digital interactions. Yet as AI systems grow more influential, concerns about gender disparities in their outputs have become increasingly urgent. These disparities are not the result of malicious intent within the technology itself? - AI has no intentions - but rather the reflection of human choices, historical inequalities, and structural biases embedded in data and design. Understanding the causes of gender disparities in AI models is essential for building systems that are fair, trustworthy, and inclusive. Equally important is identifying strategies to minimize these disparities so that AI contributes to a more equitable future.

Where Gender Disparities Come From

1. Biased or Unbalanced Training Data

AI models learn from examples. If the data used to train them reflects gender imbalances, stereotypes, or historical discrimination, the model will absorb and reproduce those patterns. For example:

Datasets dominated by male subjects can lead to poorer performance on female subjects.
Text corpora containing gender‑stereotypical language can cause models to associate certain professions or traits with one gender.
Historical hiring or lending data may encode discriminatory practices.

When the data is skewed, the model’s behavior becomes skewed as well.

2. Underrepresentation in Data Collection

Some groups are simply less represented in the data. This can happen unintentionally - for example, medical datasets that include fewer women, or voice recognition systems trained primarily on male voices. Underrepresentation leads to poorer accuracy and reliability for those groups, reinforcing inequality.

3. Lack of Diversity in Development Teams

AI systems reflect the perspectives of the people who build them. When development teams lack gender diversity, blind spots can emerge. Certain use cases may be overlooked, certain harms underestimated, and certain assumptions left unchallenged. Diversity is not just a social value - it is a technical necessity for robust design.

4. Ambiguous or Biased Labeling Practices

Human annotators label data, and their judgments can introduce bias. For example, labeling images, categorizing emotions, or classifying behaviors can be influenced by cultural or gendered assumptions. If labeling guidelines are unclear or inconsistent, bias becomes baked into the dataset.

5. Reinforcement of Societal Patterns

AI models often mirror the world as it is, not as it should be. If society exhibits gender disparities in pay, leadership roles, or representation, AI systems trained on real‑world data may reinforce those disparities. Without intervention, AI becomes a feedback loop that amplifies inequality.

How Gender Disparities Can Be Minimized

1. Improve Data Quality and Representation

Balanced, diverse, and carefully curated datasets are essential. This includes:
Ensuring representation across genders
Auditing datasets for skewed distributions
Removing or mitigating harmful stereotypes

Better data leads to better outcomes.

2. Use Bias Detection and Fairness Tools

Modern AI development includes tools that can:

Detect gender‑based performance gaps
Flag biased associations
Evaluate fairness across demographic groups

Regular auditing helps identify problems early.

3. Increase Diversity in AI Teams

A broader range of perspectives leads to more thoughtful design. Gender‑diverse teams are more likely to:

Spot potential harms
Challenge assumptions
Build inclusive systems

Diversity strengthens both ethics and performance.

4. Establish Clear Ethical Guidelines

Organizations should adopt frameworks that prioritize fairness, transparency, and accountability. Ethical guidelines help ensure that gender considerations are not an afterthought but a core design principle.

5. Encourage Human Oversight

AI should support - not replace - human judgment. Human review can catch issues that automated systems miss, especially in sensitive areas like hiring, healthcare, or legal decisions.

Closing Statement

Gender disparities in AI models are not inevitable; they are the result of human choices, historical patterns, and structural imbalances. By understanding their causes and taking deliberate steps to address them, we can build AI systems that are more fair, more accurate, and more aligned with the values of an inclusive society. The path forward requires better data, diverse teams, ethical design, and ongoing vigilance. With thoughtful effort, AI can become not a mirror of past inequalities, but a tool for a more equitable future.

Previous Post <<||>> Next Post

01 January 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 123: Why AI Is Not Bulletproof in Automation Scenarios)

Prompt Engineering Series

Prompt: "write a blogpost of 600 words on why AI is not bulletproof is automation scenarios"

Introduction

Artificial Intelligence (AI) has become the poster child of modern automation. From supply‑chain optimization to customer service chatbots to predictive maintenance in factories, it’s easy to assume that AI systems are precise, tireless, and - at least in theory - nearly infallible. But that assumption is exactly where organizations get into trouble. AI is powerful, yes, but bulletproof? Not even close. And understanding why it isn’t bulletproof is essential for anyone deploying automation in the next decade.

Let’s unpack the cracks beneath the shiny surface.

AI Learns From Data - And Data Is Messy

AI systems don’t understand the world; they understand patterns in data. And real‑world data is full of noise, bias, gaps, and contradictions.

A model trained on historical hiring data may inherit past discrimination.
A predictive maintenance system may fail if sensors degrade or environmental conditions shift.
A customer‑service bot may misinterpret a request simply because the phrasing wasn’t in its training set.

When the data is imperfect, the automation built on top of it inherits those imperfections. AI doesn’t magically 'fix' flawed data - it amplifies it.

Automation Assumes Stability, but the Real World Is Dynamic

Traditional automation works best in stable, predictable environments. AI‑driven automation is more flexible, but it still struggles when the world changes faster than the model can adapt.

Consider:

Sudden market shifts
New regulations
Unexpected supply‑chain disruptions
Novel user behaviors
Rare edge‑case events

AI models trained on yesterday’s patterns can’t automatically understand tomorrow’s anomalies. Without continuous monitoring and retraining, automation becomes brittle.

AI Doesn’t 'Understand' - It Correlates

Even the most advanced AI systems don’t possess human‑level reasoning or contextual awareness. They operate on statistical correlations, not comprehension.

This leads to automation failures like:

Misclassifying harmless anomalies as threats
Failing to detect subtle but critical changes
Producing confident but incorrect outputs
Following rules literally when nuance is required

In high‑stakes environments - healthcare, finance, transportation - this lack of true understanding becomes a serious limitation.

Edge Cases Are the Achilles’ Heel

AI performs impressively on common scenarios but struggles with rare events. Unfortunately, automation systems often encounter exactly those rare events.

Examples include:

A self‑driving car encountering an unusual road layout
A fraud‑detection model missing a novel attack pattern
A warehouse robot misinterpreting an unexpected obstacle

Humans excel at improvisation; AI does not. Automation breaks down when reality refuses to fit the training distribution.

Security Vulnerabilities Undermine Reliability

AI systems introduce new attack surfaces:

Adversarial inputs can trick models with tiny, invisible perturbations.
Data poisoning can corrupt training sets.
Model inversion can leak sensitive information.
Prompt manipulation can cause unintended behavior in language models.
Automation built on AI can be manipulated in ways traditional systems never could.

Ethical and Governance Gaps Create Operational Risk

Even when AI works “correctly,” it may still cause harm if governance is weak.

Automation can:

Reinforce bias
Reduce transparency
Remove human oversight
Make decisions that lack accountability

Organizations often underestimate the social and regulatory risks of automated decision‑making. AI isn’t just a technical system - it’s a socio‑technical one.

Humans Are Still Part of the System

The biggest misconception about automation is that it removes humans. In reality, it changes the role of humans.

People must:

Monitor AI outputs
Intervene during failures
Interpret ambiguous results
Maintain and retrain models
Handle exceptions and edge cases

If humans aren’t properly trained or workflows aren’t redesigned, automation becomes fragile.

The Bottom Line: AI Is Powerful, but Not Invincible

AI can supercharge automation, but it’s not a magic wand. It’s a tool - one that requires careful design, continuous oversight, and a deep understanding of its limitations. Organizations that treat AI as bulletproof will face costly failures. Those that treat it as a dynamic, fallible component of a broader ecosystem will unlock its real value.

Previous Post <<||>> Next Post

09 August 2025

🧭Business Intelligence: Perspectives (Part 33: Data Lifecycle for Analytics)

Business Intelligence Series

In the context of BI, Analytics and other data-related topics, the various parties usually talk about data ingestion, preparation, storage, analysis and visualization, often ignoring processes like data generation, collection, and interpretation. It’s also true that a broader discussion may shift the attention unnecessarily, though it’s important to increase people’s awareness in respect to data’s full lifecycle. Otherwise, many of the data solutions become a mix of castles built into the air, respectively structures of cards waiting for the next flurry to be blown away.

Data is generated continuously by organizations, their customers, vendors, and third parties, as part of a complex network of processes, systems and integrations that extend beyond their intended boundaries. Independently of their type, scope and various other characteristics, all processes consume and generate data at a rapid pace that steadily exceeds organizations’ capabilities to make good use of it.

There are also scenarios in which the data must be collected via surveys, interviews, forms, measurements or direct observations, and whatever processes are used to elicit some aspect of importance. The volume and other characteristics of data generated in this way may depend on the goals and objectives in scope, respectively the methods, procedures and even the methodologies used.

Data ingestion is the process of importing data from the various sources into a central or intermediary repository for storage, processing, analysis and visualization. The repository can be a data mart, warehouse, lakehouse, data lake or any other destination intended for the intermediary or the final intended destination of data. Moreover, data can have different levels of quality in respect to its intended usage.

Data storage refers to the systems and approaches used to securely retain, organize, and access data throughout its journey within the various layers of the infrastructure. It focuses on where and how data is stored, independently on whether that’s done on-premises, in the cloud or across hybrid environments.

Data preparation is the process of transforming the data into a form close to what is intended for analysis and visualization. It may involve data aggregation, enrichment, transposition and other operations that facilitate further steps. It’s probably the most important step in a data project given that the final outcome can have an important impact on data analysis and visualization, facilitating or impeding the respective processes.

Data analysis consists of a multitude of processes that attempt to harness value from data in its various forms of aggregation. The ultimate purpose is to infer meaningful information, respectively knowledge from the data augmented as insights. The road from raw data to these targeted outcomes is a tedious one, where recipes can help and imped altogether. Expecting value from any pile of data can easily become a costly illusion when data, processes and their usage is poorly understood and harnessed.

Data visualization is the means of presenting data and its characteristics in the form of figures, diagrams and other forms of representation that facilitate data’s navigation, perception and understanding for various purposes. Usually, the final purpose is fact-checking, decision-making, problem-solving, etc., though there is a multitude of steps in between. Especially in these areas there are mixed good and poor practices altogether.

Data interpretation is the attempt of drawing meaningful conclusions from the data, information and knowledge gained mainly from data analysis and visualization. It is often a subjective interpretation as it’s usually regarded from people’s understanding of the various facts as they are considered. The inferences made in the process can be a matter of gut feeling, respectively of mature analysis. It’s about sense-making, contextualization, critical thinking, pattern recognition, internalization and externalization, and other similar cognitive processes.

Previous Post <<||>> Next Post

08 March 2025

#️⃣Software Engineering: Programming (Part XVI: The Software Quality Perspective and AI)

Software Engineering Series

Organizations tend to complain about poor software quality developed in-house, by consultancy companies or third parties, without doing much in this direction. Unfortunately, this agrees with the bigger picture reflected by the quality standards adopted by organizations - people talk and complain about them, though they aren’t that eager to include them in the various strategies, or even if they are considered, they are seldom enforced adequately!

Moreover, even if quality standards are adopted, and a lot of effort may be spent in this direction (as everybody has strong opinions and there are many exceptions), as projects progress, all the good intentions come to an end, the rules fading on the way either because are too strict, too general, aren’t adequately prioritized or communicated, or there’s no time to implement (all of) them. This applies in general to programming and to the domains that revolve around data – Business Intelligence, Data Analytics or Data Science.

The volume of good quality code and deliverables is not only a reflection of an organization’s maturity in dealing with best practices but also of its maturity in handling technical debt, Project Management, software and data quality challenges. All these aspects are strongly related to each other and therefore require a systemic approach rather than focusing on the issues locally. The systemic approach allows organizations to bridge the gaps between business areas, teams, projects and any other areas of focus.

There are many questionable studies on the effect of methodologies on software quality and data issues, proclaiming that one methodology is better than the other in addressing the multifold aspects of software quality. Besides methodologies, some studies attempt to correlate quality with organizations’ size, management or programmers’ experience, the size of software, or whatever characteristic might seem to affect quality.

Bad code is written independently of companies’ size or programmer's experience, management or organization’s maturity. Bad code doesn’t necessarily happen all at once, but it can depend on circumstances, repetitive team, requirements and code changes. There are decisions and actions that sooner or later can affect the overall outcome negatively.

Rewriting the code from scratch might look like an approachable measure though it’s seldom the cost-effective solution. Allocating resources for refactoring is usually a better approach, though this tends to increase considerably the cost of projects, and organizations might be tempted to face the risks, whatever they might be. Independently of the approaches used, sooner or later the complexity of projects, requirements or code tends to kick back.

There are many voices arguing that AI will help in addressing the problems of software development, quality assurance and probably other areas. It’s questionable how much AI will help to address the gaps, non-concordances and other mistakes in requirements, and how it will develop quality code when it has basic "understanding" issues. Even if step by step all current issues revolving around AI will be fixed, it will take time and multiple iterations until meaningful progress will be made.

At least for now, AI tools like Copilot or ChatGPT can be used for learning a programming language or framework through predefined or ad-hoc prompts. Probably, it can be used also to identify deviations from best practices or other norms in scope. This doesn’t mean that AI will replace for now code reviews, testing and other practices used in assuring the quality of software, but it can be used as an additional method to check for what was eventually missed in the other methods.

AI may also have hidden gems that when discovered, polished and sized, may have a qualitative impact on software development and software. Only time will tell what’s possible and achievable.

Previous Post <<||>> Next Post

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

11 September 2024

🗄️Data Management: Data Culture (Part IV: Quo vadis? [Where are you going?])

Data Management Series

The people working for many years in the fields of BI/Data Analytics, Data and Process Management probably met many reactions that at the first sight seem funny, though they reflect bigger issues existing in organizations: people don’t always understand the data they work with, how data are brought together as part of the processes they support, respectively how data can be used to manage and optimize the respective processes. Moreover, occasionally people torture the data until it confesses something that doesn’t necessarily reflect the reality. It’s even more deplorable when the conclusions are used for decision-making, managing or optimizing the process. In extremis, the result is an iterative process that creates more and bigger issues than whose it was supposed to solve!

Behind each blunder there are probably bigger understanding issues that need to be addressed. Many of the issues revolve around understanding how data are created, how are brought together, how the processes work and what data they need, use and generate. Moreover, few business and IT people look at the full lifecycle of data and try to optimize it, or they optimize it in the wrong direction. Data Management is supposed to help, and it does this occasionally, though a methodology, its processes and practices are as good as people’s understanding about data and its use! No matter how good a data methodology is, it’s as weak as the weakest link in its use, and typically the issues revolving around data and data understanding are the weakest link.

Besides technical people, few businesspeople understand the full extent of managing data and its lifecycle. Unfortunately, even if some of the topics are treated in the books, they are too dry, need hands on experience and some thought in corroborating practices with theories. Without this, people will do things mechanically, processes being as good as the people using them, their value becoming suboptimal and hinder the business. That’s why training on Data Management is not enough without some hands-on experience!

The most important impact is however in BI/Data Analytics areas - how the various artifacts are created and used as support in decision-making, process optimization and other activities rooted in data. Ideally, some KPIs and other metrics should be enough for managing and directing a business, however just basing the decisions on a set of KPIs without understanding the bigger picture, without having a feeling of the data and their quality, the whole architecture, no matter how splendid, can breakdown as sandcastle on a shore meeting the first powerful wave!

Sometimes it feels like organizations do things from inertia, driven by the forces of the moment, initiatives and business issues for which temporary and later permanent solutions are needed. The best chance for solving many of the issues would have been a long time ago, when the issues were still small to create any powerful waves within the organizations. Therefore, a lot of effort is sometimes spent in solving the consequences of decisions not made at the right time, and that can be painful and costly!

For building a good business one needs also a solid foundation. In the past it was enough to have a good set of products that are profitable. However, during the past decade(s) the rules of the game changed driven by the acerb competition across geographies, inefficiencies, especially in the data and process areas, costing organizations on the short and long term. Data Management in general and Data Quality in particular, even if they’re challenging to quantify, have the power to address by design many of the issues existing in organizations, if given the right chance!

Previous Post <<||>> Next Post

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 28-Mar-2024

Master Data Management (MDM)

{definition} the technologies, processes, policies, standards and guiding principles that enable the management of master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities [2],[3]
{goal} enable sharing of information assets across business domains and applications within an organization [4]
{goal} provide authoritative source of reconciled and quality-assessed master (and reference) data [4]
{goal} lower cost and complexity through use of standards, common data models, and integration patterns [4]
{driver} meeting organizational data requirements
{driver} improving data quality
{driver} reducing the costs for data integration
{driver} reducing risks
{type} operational MDM

involves solutions for managing transactional data in operational applications [1]
rely heavily on data integration technologies

{type} analytical MDM

involves solutions for managing analytical master data
centered on providing high quality dimensions with multiple hierarchies [1]
cannot influence operational systems

any data cleansing made within operational application isn’t recognized by transactional applications [1]

⇒ inconsistencies to the main operational data [1]

transactional application knowledge isn’t available to the cleansing process

{type} enterprise MDM

involves solutions for managing both transactional and analytical master data

manages all master data entities
deliver maximum business value

operational data cleansing

improves the operational efficiencies of the applications and the business processes that use the applications

cross-application data need

consolidation
standardization
cleansing
distribution

needs to support high volume of transactions

⇒ master data must be contained in data models designed for OLTP

⇐ ODS don’t fulfill this requirement

{enabler} high-quality data
{enabler} data governance
{benefit} single source of truth

used to support both operational and analytical applications in a consistent manner [1]

{benefit} consistent reporting

reduces the inconsistencies experienced previously
influenced by complex transformations

{benefit} improved competitiveness

MDM reduces the complexity of integrating new data and systems into the organization

⇒ increased flexibility and improves competitiveness

ability to react to new business opportunities quickly with limited resources

{benefit} improved risk management

more reliable and consistent data improves the business’s ability to manage enterprise risk [1]

{benefit} improved operational efficiency and reduced costs

helps identify business’ pain point

by developing a strategy for managing master data

{benefit} improved decision making

reducing data inconsistency diminishes organizational data mistrust and facilitates clearer (and faster) business decisions [1]

{benefit} more reliable spend analysis and planning

better data integration helps planners come up with better decisions

improves the ability to

aggregate purchasing activities
coordinate competitive sourcing
be more predictable about future spending
generally improve vendor and supplier management

{benefit} regulatory compliance

allows to reduce compliance risk

helps satisfy governance, regulatory and compliance requirements

simplifies compliance auditing

enables more effective information controls that facilitate compliance with regulations

{benefit} increased information quality

enables organizations to monitor conformance more effectively

via metadata collection
it can track whether data meets information quality expectations across vertical applications, which reduces information scrap and rework

{benefit} quicker results

reduces the delays associated with extraction and transformation of data [1]

⇒ it speeds up the implementation of application migrations, modernization projects, and data warehouse/data mart construction [1]

{benefit} improved business productivity

gives enterprise architects the chance to explore how effective the organization is in automating its business processes by exploiting the information asset [1]

⇐ master data helps organizations realize how the same data entities are represented, manipulated, or exchanged across applications within the enterprise and how those objects relate to business process workflows [1]

{benefit} simplified application development

provides the opportunity to consolidate the application functionality associated with the data lifecycle [1]

⇐ consolidation in MDM is not limited to the data
⇒ provides a single functional to which different applications can subscribe

⇐ introducing a technical service layer for data lifecycle functionality provides the type of abstraction needed for deploying SOA or similar architectures

factors to consider for implementing an MDM:

effective technical infrastructure for collaboration [1]
organizational preparedness

for making a quick transition from a loosely combined confederation of vertical silos to a more tightly coupled collaborative framework
{recommendation} evaluate the kinds of training sessions and individual incentives required to create a smooth transition [1]

metadata management

via a metadata registry

{recommendation} sets up a mechanism for unifying a master data view when possible [1]
determines when that unification should be carried out [1]

technology integration

{recommendation} diagnose what technology needs to be integrated to support the process instead of developing the process around the technology [1]

anticipating/managing change

proper preparation and organization will subtly introduce change to the way people think and act as shown in any shift in pattern [1]
changes in reporting structures and needs are unavoidable

creating a partnership between Business and IT

IT roles

plays a major role in executing the MDM program[1]

business roles

identifying and standardizing master data [1]
facilitating change management within the MDM program [1]
establishing data ownership

measurably high data quality
overseeing processes via policies and procedures for data governance [1]

{challenge} establishing enterprise-wide data governance

{recommendation} define and distribute the policies and procedures governing the oversight of master data

seeking feedback from across the different application teams provides a chance to develop the stewardship framework agreed upon by the majority while preparing the organization for the transition [1]

{challenge} isolated islands of information

caused by vertical alignment of IT

makes it difficult to fix the dissimilarities in roles and responsibilities in relation to the isolated data sets because they are integrated into a master view [1]

caused by data ownership

the politics of information ownership and management have created artificial exclusive domains supervised by individuals who have no desire to centralize information [1]

{challenge} consolidating master data into a centrally managed data asset [1]

transfers the responsibility and accountability for information management from the lines of business to the organization [1]

{challenge} managing MDM

MDM should be considered a program and not a project or an application [1]

{challenge} achieving timely and accurate synchronization across disparate systems [1]
{challenge} different definitions of master metadata
- different coding schemes, data types, collations, and more
{challenge} data conflicts

{recommendation} resolve data conflicts during the project [5]
{recommendation} replicate the resolved data issues back to the source systems [5]

{challenge} domain knowledge

{recommendation} involve domain experts in an MDM project [5]

{challenge} documentation

{recommendation} properly document your master data and metadata [5]

approaches

{architecture} no central MDM

isn’t a real MDM approach
used when any kind of cross-system interaction is required [5]

e.g. performing analysis on data from multiple systems, ad-hoc merging and cleansing

{drawback} very inexpensive at the beginning; however, it turns out to be the most expensive over time [5]

{architecture} central metadata storage

provides unified, centrally maintained definitions for master data [5]

followed and implemented by all systems

ad-hoc merging and cleansing becomes somewhat simpler [5]
does not use a specialized solution for the central metadata storage [5]

⇐ the central storage of metadata is probably in an unstructured form

e.g. documents, worksheets, paper

{architecture} central metadata storage with identity mapping

stores keys that map tables in the MDM solution

only has keys from the systems in the MDM database; it does not have any other attributes [5]

{benefit} data integration applications can be developed much more quickly and easily [5]
{drawback} raises problems in regard to maintaining master data over time [5]

there is no versioning or auditing in place to follow the changes [5]

⇒ viable for a limited time only

e.g. during upgrading, testing, and the initial usage of a new ERP system to provide mapping back to the old ERP system

{architecture} central metadata storage and central data that is continuously merged

stores metadata as well as master data in a dedicated MDM system
master data is not inserted or updated in the MDM system [5]
the merging (and cleansing) of master data from source systems occurs continuously, regularly [5]
{drawback} continuous merging can become expensive [5]
the only viable use for this approach is for finding out what has changed in source systems from the last merge [5]

enables merging only the delta (new and updated data)

frequently used for analytical systems

{architecture} central MDM, single copy

involves a specialized MDM application

master data, together with its metadata, is maintained in a central location [5]
⇒ all existing applications are consumers of the master data

{drawback} upgrade all existing applications to consume master data from central storage instead of maintaining their own copies [5]

⇒ can be expensive
⇒ can be impossible (e.g. for older systems)

{drawback} needs to consolidate all metadata from all source systems [5]
{drawback} the process of creating and updating master data could simply be too slow [5]

because of the processes in place

{architecture} central MDM, multiple copies

uses central storage of master data and its metadata

⇐ the metadata here includes only an intersection of common metadata from source systems [5]
each source system maintains its own copy of master data, with additional attributes that pertain to that system only [5]

after master data is inserted into the central MDM system, it is replicated (preferably automatically) to source systems, where the source-specific attributes are updated [5]
{benefit} good compromise between cost, data quality, and the effectiveness of the CRUD process [5]
{drawback} update conflicts

different systems can also update the common data [5]

⇒ involves continuous merges as well [5]

{drawback} uses a special MDM application

Previous Post <<||>> Next Post

Acronyms:

MDM - Master Data Management

ODS - Operational Data Store

OLAP - online analytical processing

OLTP - online transactional processing

SOA - Service Oriented Architecture

References:
[1] The Art of Service (2017) Master Data Management Course
[2] DAMA International (2009) "The DAMA Guide to the Data Management Body of Knowledge" 1st Ed.

[3] Tony Fisher 2009 "The Data Asset"

[4] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed.

[5] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

20 March 2024

🗄️Data Management: Master Data Management (Part I: Understanding Integration Challenges) [Answer]

Data Management Series

Answering Piethein Strengholt’s post [1] on Master Data Management’s (MDM) integration challenges, the author of "Data Management at Scale".

Master data can be managed within individual domains though the boundaries must be clearly defined, and some coordination is needed. Attempting to partition the entities based on domains doesn’t always work. The partition needs to be performed at attribute level, though even then might be some exceptions involved (e.g. some Products are only for Finance to use). One can identify then attributes inside of the system to create the boundaries.

MDM is simple if you have the right systems, processes, procedures, roles, and data culture in place. Unfortunately, people make it too complicated – oh, we need a nice shiny system for managing the data before they are entered in ERP or other systems, we need a system for storing and maintaining the metadata, and another system for managing the policies, and the story goes on. The lack of systems is given as reason why people make no progress. Moreover, people will want to integrate the systems, increasing the overall complexity of the ecosystem.

The data should be cleaned in the source systems and assessed against the same. If that's not possible, then you have the wrong system! A set of well-built reports can make data assessment possible.

The metadata and policies can be maintained in Excel (and stored in SharePoint), SharePoint or a similar system that supports versioning. Also, for other topics can be found pragmatic solutions.

ERP systems allow us to define workflows and enable a master data record to be published only when the information is complete, though there will always be exceptions (e.g., a Purchase Order must be sent today). Such exceptions make people circumvent the MDM systems with all the issues deriving from this.

Adding an MDM system within an architecture tends to increase the complexity of the overall infrastructure and create more bottlenecks. Occasionally, it just replicates the structures existing in the target system(s).

Integrations are supposed to reduce the effort, though in the past 20 years I never saw an integration to work without issues, even in what MDM concerns. One of the main issues is that the solutions just synchronized the data without considering the processual dependencies, and sometimes also the referential dependencies. The time needed for troubleshooting the integrations can easily exceed the time for importing the data manually over an upload mechanism.

To make the integration work the MDM will arrive to duplicate the all the validation available in the target system(s). This can make sense when daily or weekly a considerable volume of master data is created. Native connectors simplify the integrations, especially when it can handle the errors transparently and allow to modify the records manually, though the issues start as soon the target system is extended with more attributes or other structures.

If an organization has an MDM system, then all the master data should come from the MDM. As soon as a bidirectional synchronization is used (and other integrations might require this), Pandora’s box is open. One can define hard rules, though again, there are always exceptions in which manual interference is needed.

Attempting an integration of reference data is not recommended. ERP systems can have hundreds of such entities. Some organizations tend to have a golden system (a copy of production) with all the reference data. It works for some time, until people realize that the solution is expensive and time-consuming.

MDM systems do make sense in certain scenarios, though to get the integrations right can involve a considerable effort and certain assumptions and requirements must be met.

Previous Post <<||>> Next Post

References:
[1] Piethein Strengholt (2023) Understanding Master Data Management’s Integration Challenges (link)

12 February 2024

🧭Business Intelligence: A One-Man Show (Part I: Some Personal Background and a Big Thanks!)

Business Intelligence Series

Over the past 24 years, I found myself often in the position of a "one man show" doing almost everything in the data space from requirements gathering to development, testing, deployment, maintenance/support (including troubleshooting and optimization), and Project Management, respectively from operations to strategic management, when was the case. Of course, different tasks of varying complexity are involved! Developing a SSRS or Power BI report has a smaller complexity than developing in the process also all or parts of the Data Warehouse, or Lakehouse nowadays, respectively of building the whole infrastructure needed for reporting. All I can say is that "I've been there, I've done that!".

Before SSRS became popular, I even built for a customer a whole reporting solution based on SQL Server, HTML & XML, respectively COM+ objects for database access. UI’s look-and-feel was like SSRS, though there was no wizardry involved besides the creative use of programming and optimization techniques. Once I wrote an SQL query, the volume of work needed to build a report was comparable to the one in SSRS. It was a great opportunity to use my skillset, working previously as a web developer and VB/VBA programmer. I worked for many years as a Software Engineer, applying the knowledge acquired in the field whenever it made sense to do so, working alone or in a team, as the projects required.

During this time, I was involved in other types of projects and activities that had less to do with the building of reports and warehouses. Besides of the development of various desktop, web, and data-processing solutions, I was also involved in 6-8 ERP implementations, being responsible for the migration of data, building the architectures needed in the process, supporting key users in various areas like Data Quality or Data Management. I also did Project Management, Application Management, Release and Change Management, and even IT Management. Thus, there were at times at least two components involved - one component was data-related, while the other component had more diversity. It was a good experience, because the second component often needed knowledge of the first, and vice versa.

For example, arriving to understand the data model and business processes behind an ERP system by building ad-hoc and standardized reports, allowed me to get a good understanding of what data is needed for a Data Migration, which are the dependencies, or the level of quality needed. Similarly, the knowledge acquired by building ETL-based pipelines and data warehouses allowed me to design and build flexible Data Migration solutions, both architectures being quite similar from many perspectives. Knowledge of the data models and architectures involved can facilitate the overall process and is a premise for building reliable performant solutions.

Similar examples can also be given in Data Management, Data Operations, Data Governance, during and post-implementation ERP support, etc. Reports and data are needed also in the Management areas - it starts from knowing what data are needed in the supporting processes for providing transparency, of getting insights and bringing the processes under control, if needed.

Working alone, being able to build a solution from the beginning to the end was often a job requirement. This doesn't imply that I was a "lone wolf". The nature of a data professional or software engineer’s job requires you to interact with various businesspeople from report requesters to key users, internal and external consultants, intermediary managers, and even upper management. There was also the knowledge of many data professionals involved indirectly – the resources I used to learn from - books, tutorials, blogs, webcasts, code, and training material. I'm thankful for their help over all these years!

Previous Post <<||>> Next Post

21 August 2022

🧮ERP: Implementations (Part VI: It’s all about Partnership II - Closing the Gap)

ERP Implementations Series

When starting an ERP implementation project an organization needs to fill the existing knowledge gaps in respect to whatever it takes to achieve the goals associated with the respective project. Therefore, it makes sense to work with a implementer that can help cover the gaps directly or indirectly. Moreover, it makes sense to establish a long-term relationship that would allow to harness ERP system’s capabilities after project’s end, increase the ROI and, why not, find other areas of cooperation. It’s in theory what a partner does, and a strategic technology partnership is about – providing any kind of technological expertise the customer doesn't have in-house.

Unfortunately, from being a ‘service provider’ to becoming a ‘partner’ is a challenging road for many organizations, especially when this type of relationship is not understood and managed accordingly. Partnership’s management may resume in defining common goals, principles, values and processes, establishing a communication strategy and a common understanding of the challenges and the steps ahead, providing visibility into the cost estimates, billing, resources’ availability and utilization. Addressing these aspects would offer a framework on which the partnerships can nourish. Without considering these topics, the implementer remains just a 'service provider', no matter of the names used to characterize the relationship.

Now, the use of the word ‘partner’ would make someone think that only one partner is considered, typically a big to middle-sized organization that would have this kind of resources. The main reason behind this reasoning is that the number of functional areas and volume of skillset required for filling the requirements of an implementation are high compared with other projects, the resources needing to be available on-demand without affecting the other constraints: costs, quality, time. This can be challenging, therefore can be met scenarios in which two or more external organizations are involved in the partnership, ideally organizations that complement each other.

It is common in ERP implementations to appeal also to individual consultants for specific areas or the whole project. The principles and values of a partnership, as well the framework behind, can be applied to individual consultants as well. Independently of resources’ provenience more important is the partnership ‘mindset’ - being together in the same boat, working together on a shared and understood strategy, with clear goals and objectives.

Moreover, the people participating in the project must have a ‘partner's mindset’ as well. Without this, the project will likely get different impulses in the wrong direction(s), as a group’s interests will take priority over the ones of the organization. Ideally, this mindset should extend to the whole organization as topics like Data Quality and Process Improvement must be an organization’s effort, deeply imprinted in organization’s culture.

More like ever, it’s important for the business to see and treat the IT department as a ‘partner’ and not as a ‘service provider’ by providing the needed level of transparency in requirements, issues, practices and processes, by treating the IT department as equal party in the decision-making and addressing its current and future strategical requirements. Ideally, this partnership should happen long before the implementation starts, given that it takes time for mentalities and practices to change, for knowledge to be acquired and used appropriately.

Building a partnership takes time, effort and strategic thinking, this on top of the actual implementation, increasing thus the overall complexity, at least at the beginning. Does it pay off? Like in a marriage, it’s useful to have somebody you can trust, who knows you, whom you can rely upon, and talk with to find solutions. However, only time will tell whether such expectations are met and kept till the end.

Previous <<||>> Next Post

SQL Troubles

Pages

03 April 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 157: The Transformative Power of High‑Quality Training Data in AI)

25 March 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 148: How Narrow Data Limits Exposure to Ethical Diversity in AI)

15 March 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 138: Understanding Gender Disparities in AI Models and How to Minimize Them)

01 January 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 123: Why AI Is Not Bulletproof in Automation Scenarios)

09 August 2025

🧭Business Intelligence: Perspectives (Part 33: Data Lifecycle for Analytics)

08 March 2025

#️⃣Software Engineering: Programming (Part XVI: The Software Quality Perspective and AI)

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

11 September 2024

🗄️Data Management: Data Culture (Part IV: Quo vadis? [Where are you going?])

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

20 March 2024

🗄️Data Management: Master Data Management (Part I: Understanding Integration Challenges) [Answer]

12 February 2024

🧭Business Intelligence: A One-Man Show (Part I: Some Personal Background and a Big Thanks!)

21 August 2022

🧮ERP: Implementations (Part VI: It’s all about Partnership II - Closing the Gap)

About Me