SQL Troubles: Data Management

Showing posts with label Data Management. Show all posts

12 July 2026

🎯Fadi Maali - Collected Quotes

"A common mistake when implementing a data catalog is to focus only on technical metadata. This limits its use and the potential value. It also excludes business users who have valuable related input or need to use the catalog. A catalog should in fact function as a two-way translation layer between technical and business users." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"A core premise of data mesh is federating data ownership among domain data owners who are responsible for their data as a product. Offering the data as a product requires the data to be discoverable and to have explicitly stated quality characteristics and a clearly defined access method. Such requirements are at the core of what data catalogs support. With support for data labeling, curation, and crowdsourced feedback, data catalogs are well positioned to offer data as a product. Furthermore, data catalogs support the enforcement of compliant data usage, which becomes more important when data ownership is not managed centrally." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Active governance guides users as they find and use data. A data catalog with active governance will surface compliance information about sensitive data at point of use, so as to encourage users to use canonical and high-quality data assets; it will also provide a way to ask domain experts for help. They actively help users to ensure compliant usage of data with features such as masking, which anonymizes PII for given user personas who are restricted from viewing it per the GDPR." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Build a community around the catalog. Make sure data producers, stewards, and consumers are all involved and empowered to enrich the content of the catalog. Establish a leader or a team to have clear ownership of the data catalog." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Data catalog platforms take a more holistic view to focus not only on data assets within an enterprise, but also on the surrounding ecosystem (including business and people elements). They are typically characterized by an extensible data model that can grow to define various assets and concepts, such as metrics, charts, AI features, and users. Data catalog platforms typically augment their data with a focus on business and users to support collaborative governance and enrichment of metadata and to interlink data with business glossaries and dictionaries. Moreover, they are architected to make them easily integrable with other systems." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Data catalogs that focus on governance are concerned mainly with controlling data access and ensuring that data is used according to defined policies; this includes external policies such as data privacy laws as well as policies defined with an enterprise. Those catalogs apply techniques to identify data assets with sensitive information and to monitor data flow and access." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Data catalogs that focus on search bring techniques and methods from information retrieval and web search engines to the data domain within enterprises. Some of those catalogs, such as Facebook Nemo, use advanced machine learning and NLP tools to provide personalized search of data within an enterprise. The search can also use data-specific signals such as usage, popularity, and freshness to rank data assets by usefulness." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Enterprises typically become interested in data catalogs when they have a specific use case or need in mind. Data governance, self-service analytics, and cloud data migration are common examples. Having a specific need or use case helps focus efforts and measure impact. However, as with other technical efforts within enterprises, it is essential to prepare for long-term sustainable success and to have a plan to maximize successful adoption." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Historically, for their analytics needs, enterprises relied upon a set of tightly coupled tools, typically provided by a single vendor. Nowadays, nearly all of the components of a traditional data warehouse are independent and interchangeable. Those independent tools can be flexibly combined to provide a modern data stack. It is common for current enterprises to have separate tools for data ingestion, data pipelines, data storage and querying, data visualization and business intelligence, and data quality. Furthermore, data can flow in the opposite direction out of the data warehouse in what is referred to as reverse extract, transform, and load (ETL)." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"In a self-service environment with multiple publishers, it’s impossible to completely avoid data redundancy and overlapping. Multiple data assets with similar content, but possibly with varying quality, will exist. A data catalog can guide users to trusted data that comes from a reliable source and is frequently used. A data catalog can also use various explicit and implicit quality signals when ranking datasets for recommendation. Some of those signals are discussed next. Furthermore, a data catalog can recommend domain experts who are automatically identified based on actual data usage." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"It is often said that data scientists and data analysts spend only 20% of their time doing data analysis work, with 80% consumed by data 'issues'. The bulk of their time is spent finding, evaluating, understanding, and preparing data before analysis can begin. A data catalog inverts this principle by enabling data analysts and data scientists to spend 20% of their time looking for data and 80% performing analysis." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

"Self-service BI initiatives help organizations become more data-driven and democratize access to data. But data can’t be used if it can’t be found. Search and discovery of trustworthy data is a core value of enterprise data catalogs, and the value extends well beyond business users." (Fadi Maali & Jason Lim, "Implementing a Modern Data Catalog to Power Data Intelligence: Make Trustworthy Data Central to Your Organization", 2022)

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)

{def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)

includes

the transportation of data
overlaps between stages
changes in flow

due to

new technologies
changing requirements

changes in scope
troubleshooting

due to data mismatches

{step} extraction

data is extracted directly from the source systems or intermediate repositories

data may be made available in intermediate repositories, when the direct access to the source system is not possible

⇐ this approach can add a complexity layer

{substep} data validation

an automated process that validates whether data pulled from sources has the expected values
relies on a validation engine

rejects data if it falls outside the validation rules
analyzes rejected records on an ongoing basis to

identifies what went wrong
corrects the source data
modifies extraction to resolve the problem in the next batches

{step} transform

transforms the data, removing extraneous or erroneous data
applies business rules
checks data integrity

ensures that the data is not corrupted in the source or corrupted by ETL
may ensure no data was dropped in previous stages

aggregates the data if necessary

{step} load

{substep} store the data into a staging layer

transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
{advantage} makes it easier to roll back, if something went wrong
{advantage} allows to develop the logic iteratively and publish the data only when needed
{advantage} can be used to generate audit reports for

regulatory compliance
diagnose and repair of data problems

modern ETL process perform transformations in place, instead of in staging areas

{substep} publish the data to the target

loads the data into the target table(s)
{scenario} the existing data are overridden every time the ETL pipeline loads a new batch

this might happen daily, weekly, or monthly

{scenario} add new data without overriding

the timestamp can indicate the data is new

{recommendation} prevent the loading process to error out due to disk space and performance limitations

{approach} building an ETL infrastructure

involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly

recurring process

e.g. data used for reporting

one-time process

e.g. data migration

may involve

multiple source or destination systems
different types of data

e.g. reference, master and transactional data
⇐ may have complex dependencies

different level of structuredness

e.g. structured, semistructured, nonstructured

different data formats
data of different quality
different ownership

{recommendation} consider ETL best practices

{best practice} define requirements upfront in a consolidated and consistent manner

allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
the requirements may involve all the aspects of the process

e.g. data extraction, data transformation, standard formatting, etc.

{best practice} define a high level strategy

allows to define the road ahead, risks and other aspects
allows to provide transparency
this may be part of a broader strategy that can be referenced

{best practice} align the requirements and various aspects to the existing strategies existing in the organization

allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
e.g. IT, business, Information Security, Data Management strategies

{best practice} define the scope upfront

allows to better estimate the effort and validate the outcomes
even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations

{best practice} manage the effort as a project and use a suitable Project Management methodology

allows to apply structured well-established PM practices
it might be suited to adapt the methodology to project's particularities

{best practice} convert data to standard formats to standardize data processing

allows to reduce the volume of issues resulted from data type mismatches
applies mainly to dates, numeric or other values for which can be defined standard formats

{best practice} clean the data in the source systems, when cost-effective

allows to reduces the overall effort, especially when this is done in advance
this should be based ideally on the scope

{best practice} define and enforce data ownership

allows to enforce clear responsibilities across the various processes
allows to reduce the overall effort

{best practice} document data dependencies

document the dependencies existing in the data at the various levels

{best practice} protocol data movement from source(s) to destination(s) in term of data volume

allows to provide transparence into the data movement process
allows to identify gaps in the data or broader issues
can be used for troubleshooting and understanding the overall data growth

{recommendation} consider proven systems, architectures and methodologies

allows to minimize the overall effort and costs associated with the process

Previous Post <<||>> Next Post

15 October 2024

🗄️Data Management: Data Governance (Part III: Taming the Complexity)

Data Management Series

The Chief Data Officer (CDO) or the “Head of the Data Team” is one of the most challenging jobs because is more of a "political" than a technical role. It requires the ideal candidate to be able to throw and catch curved balls almost all the time, and one must be able to play ball with all the parties having an interest in data (aka stakeholders). It’s a full-time job that requires the combination of management and technical skillsets, and both are important! The focus will change occasionally in one direction more than in the other, with important fluctuations.

Moreover, even if one masters the technical and managerial aspects, the combination of the two gives birth to situations that require further expertise – applied systems thinking being probably the most important. This, also because there are so many points of failure that it's challenging to address all the important causes. Therefore, it’s critical to be a system thinker, to have an experienced team and make use adequately of its experience!

In a complex word, in which even the smallest constraint or opportunity can have an important impact especially when it’s involved in the early stages of the processes taking place in organizations. It relies on the manager’s and team’s skillset, their inspiration, the way the business reacts to the tasks involved and probably many other aspects that make things work. It takes considerable effort until the whole mechanism works, and even more time to make things work efficiently. The best metaphor is probably the one of a small combat team in which everybody has their place and skillset in the mechanism, independently if one talks about strategy, tactics or operations.

Unfortunately, building such teams takes time, and the more people are involved, the more complex this endeavor becomes. The manager and the team must meet somewhere in the middle in what concerns the philosophy, the execution of the various endeavors, the way of working together to achieve the same goals. There are multiple forces pulling in all directions and it takes time until one can align the goals, respectively the effort.

The most challenging forces are the ones between the business and the data team, respectively the business and data requirements, forces that don’t necessarily converge. Working in small organizations, the two parties have in theory more challenges to overcome the challenges and a team’s experience can weight a lot in the process, though as soon the scale changes, the number of challenges to be overcome changes exponentially (there are however different exponential functions in which the basis and exponent make the growth rapid).

In big organizations can appear other parties that have the same force to pull the weight in one direction or another. Thus, the political aspects become more complex to the degree that the technologies must follow the political decisions, with all the positive and negative implications deriving from this. As comparison, think about the challenges from moving from two to three or more moving bodies orbiting each other, resulting in a chaotic dynamical system for most initial conditions.

Of course, a business’ context doesn’t have to create such complexity, though when things are unchecked, when delays in decision-making as well as other typical events occur, when there’s no structure, strategy, coordinated effort, or any other important components, the chances for chaotic behavior are quite high with the pass of time. This is just a model to explain real life situations that seem similar on the surface but prove to be quite complex when diving deeper. That’s probably why a CDO’s role as tamer of complexity is important and challenging!

Previous Post <<||>> Next Post

17 September 2024

#️⃣Software Engineering: Mea Culpa (Part V: All-Knowing Developers are Back in Demand?)

Software Engineering Series

I’ve been reading many job descriptions lately related to my experience and curiously or not I observed that many organizations look for developers with Microsoft Dynamics experience in the CRM, respectively Finance and Operations (F&O) and Business Central (BC) areas. It’s a good sign that the adoption of Microsoft solutions for CRM and ERP increases, especially when one considers the progress made in the BI and AI areas with the introduction of Microsoft Fabric, which gives Microsoft a considerable boost. Conversely, it seems that the "developers are good for everything" syntagma is back, at least from what one reads in job descriptions.

Of course, it’s useful to have an inhouse developer who can address all the aspects of an implementation, though that’s a lot to ask considering the different non-programming areas that need to be addressed. It’s true that a developer with experience can handle Requirements, Data and Process Management, respectively Data Migrations and Business Intelligence topics, though if one considers that each of the topics can easily become a full-time job before, during and post-project implementations. I’ve been there and I (hopefully) know that the jobs imply. Even if an experienced programmer can easily handle the different aspects, there will be also times when all the topics combined will be too much for a person!

It's not a novelty that job descriptions are treated like Christmas lists, but it’s difficult to differentiate between essential and nonessential skillset. I read many jobs descriptions lately in which among a huge list of demands, one of the requirements is to program in the F&O framework, sign that D365 programmers are in high demand. I worked for many years as programmer and Software Engineer, respectively in the BI area, where SQL and non-SQL code is needed. Even if I can understand the code in F&O, does it make sense to learn now to program in X++ and the whole framework?

It's never too late to learn new tricks, respectively another programming language and/or framework. It even helps to provide better solutions in usual areas, though frankly I would invest my time in other areas, and AI-related topics like AI prompting or Data Science seem to be more interesting on the long run, especially when they are already in demand!

There seems to be a tendency for Data Science professionals to do everything, building their own solutions, ignoring the experience accumulated respectively the data models built in BI and Data Analytics areas, as if the topics and data models are unrelated! It’s also true that AI-modeling comes with its own requirements in what concerns data modeling (e.g. translating non-numeric to numeric values), though I believe that common ground can be found!

Similarly, the notebook-based programming seems to replicate logic in each solution, which occasionally makes sense, though personally I wouldn’t recommend it as practice! The other day, I was looking at code developed in Python to mimic the joining of tables, when a view with the same could be easier (re)used, maintained, read and probably more efficient, even if different engines will be used. It will be interesting to see how the mix of spaghetti solutions will evolve over time. There are developers already complaining of the number of objects used in the process by building logic for each layer from the medallion architecture! Even if it makes sense from architectural considerations, it will become a nightmare in time.

One can wonder also about nomenclature used – Data Engineer or Prompt Engineering for the simple manipulation of data between structures in data transformations, respectively for structuring the prompts for AI. I believe that engineering involves more than this, no matter the context!

Previous Post <<||>> Next Post

14 September 2024

🗄️Data Management: Data Governance (Part II: Heroes Die Young)

Data Management Series

In the call for action there are tendencies in some organizations to idealize and overcharge main actors' purpose and image when talking about data governance by calling them heroes. Heroes are those people who fight for a goal they believe in with all their being and occasionally they pay the supreme tribute. Of course, the image of heroes is idealized and many other aspects are ignored, though such images sell ideas and ideals. Organizations might need heroes and heroic deeds to change the status quo, but the heroism doesn't necessarily payoff for the "heroes"!

Sometimes, organizations need a considerable effort to change the status quo. It can be people's resistance to new, to the demands, to the ideas propagated, especially when they are not clearly explained and executed. It can be the incommensurable distance between the "AS IS" and the "TO BE" perspectives, especially when clear paths aren't in sight. It can be the lack of resources (e.g., time, money, people, tools), knowledge, understanding or skillset that makes the effort difficult.

Unfortunately, such initiatives favor action over adequate strategies, planning and understanding of the overall context. The call do to something creates waves of actions and reactions which in the organizational context can lead to storms and even extreme behavior that ranges from resistance to the new to heroic deeds. Finding a few messages that support the call for action can help, though they can't replace the various critical for success factors.

Leading organizations on a new path requires a well-defined realistic strategy, respectively adequate tactical and operational planning that reflects organizations' specific needs, knowledge and capabilities. Just demanding from people to do their best is not enough, and heroism has chances to appear especially in this context. Unfortunately, the whole weight falls on the shoulders of the people chosen as actors in the fight. Ideally, it should be possible to spread the whole weight on a broader basis which should be considered the foundation for the new.

The "heroes" metaphor is idealized and the negative outcome probably exaggerated, though extreme situations do occur in organizations when decisions, planning, execution and expectations are far from ideal. Ideal situations are met only in books and less in practice!

The management demands and the people execute, much like in the army, though by contrast people need to understand the reasoning behind what they are doing. Proper execution requires skillset, understanding, training, support, tools and the right resources for the right job. Just relying on people's professionalism and effort is not enough and is suboptimal, but this is what many organizations seem to do!

Organizations tend to respond to the various barriers or challenges with more resources or pressure instead of analyzing and depicting the situation adequately, and eventually change the strategy, tactics or operations accordingly. It's also difficult to do this as long an organization doesn't have the capabilities and practices of self-check, self-introspection, self-reflection, etc. Even if it sounds a bit exaggerated, an organization must know itself to overcome the various challenges. Regular meetings, KPIs and other metrics give the illusion of control when self-control is needed.

Things don't have to be that complex even if managing data governance is a complex endeavor. Small or midsized organizations are in theory more capable to handle complexity because they can be more agile, have a robust structure and the flow of information and knowledge has less barriers, respectively a shorter distance to overcome, at least in theory. One can probably appeal to the laws and characteristics of networks to understand more about the deeper implications, of how solutions can be implemented in more complex setups.

Previous Post <<||>> Next Post

🗄️Data Management: Data Culture (Part V: Quid nunc? [What now?])

Data Management Series

Despite the detailed planning, the concentrated and well-directed effort with which the various aspects of data culture are addressed, things don't necessarily turn into what we want them to be. There's seldom only one cause but a mix of various factors that create a network of cause and effect relationships that tend to diminish or increase the effect of certain events or decisions, and it can be just a butterfly's flutter that stirs a set of chained reactions. The butterfly effect is usually an exaggeration until the proper conditions for the chaotic behavior appear!

The butterfly effect is made possible by the exponential divergence of two paths. Conversely, success needs probably multiple trajectories to converge toward a final point or intermediary points or areas from which things move on the "right" path. Success doesn't necessarily mean reaching a point but reaching a favorable zone for future behavior to follow a positive trend. For example, a sink or a cone-like structure allow water to accumulate and flow toward an area. A similar structure is needed for success to converge, and the structure results from what is built in the process.

Data culture needs a similar structure for the various points of interest to converge. Things don't happen by themselves unless the force of the overall structure is so strong that allows things to move toward the intended path(s). Even then the paths can be far from optimal, but they can be favorable. Probably, that's what the general effort must do - bring the various aspects in the zone for allowing things to unfold. It might still be a long road, though the basis is there!

A consequence of this metaphor is that one must identify the important aspects, respectively factors that influence an organization's culture and drive them in the right direction(s) – the paths that converge toward the defined goal(s). (Depending on the area of focus one can consider that there are successions of more refined goals.)

The structure that allows things to converge is based on the alignment of the various paths and implicitly forces. Misalignment can make a force move in other direction with all the consequences deriving from this behavior. If its force is weak, probably will not have an impact over the overall structure, though that's relative and can change in time.

One may ask for what's needed all this construct, even if it doesn’t reflect the reality. Sometimes, even a not entirely correct model can allow us to navigate the unknown. Model's intent is to depict what's needed for a initiative to be successful. Moreover, success doesn’t mean to shoot bulls eye but to be first in the zone until one's skillset enables performance.

Conversely, it's important to understand that things don't happen by themselves. At least this seems to be the feeling some initiatives let. One needs to build and pull the whole structure in the right direction and the alignment of the various forces can reduce the overall effort and increase the chances for success. Attempting to build something just because it’s written in documentation without understanding the whole picture (or something close to it) can easily lead to failure.

This doesn’t mean that all attempts that don’t follow a set of patterns are doomed to failure, but that the road will be more challenging and will probably take longer. Conversely, maybe these deviations from the optimal paths are what an organization needs to grow, to solidify the foundation on which something else can be built. The whole path is an exploration that doesn’t necessarily match what is written in books, respectively the expectations!

Previous Post <<||>> Next Post

11 September 2024

🗄️Data Management: Data Culture (Part IV: Quo vadis? [Where are you going?])

Data Management Series

The people working for many years in the fields of BI/Data Analytics, Data and Process Management probably met many reactions that at the first sight seem funny, though they reflect bigger issues existing in organizations: people don’t always understand the data they work with, how data are brought together as part of the processes they support, respectively how data can be used to manage and optimize the respective processes. Moreover, occasionally people torture the data until it confesses something that doesn’t necessarily reflect the reality. It’s even more deplorable when the conclusions are used for decision-making, managing or optimizing the process. In extremis, the result is an iterative process that creates more and bigger issues than whose it was supposed to solve!

Behind each blunder there are probably bigger understanding issues that need to be addressed. Many of the issues revolve around understanding how data are created, how are brought together, how the processes work and what data they need, use and generate. Moreover, few business and IT people look at the full lifecycle of data and try to optimize it, or they optimize it in the wrong direction. Data Management is supposed to help, and it does this occasionally, though a methodology, its processes and practices are as good as people’s understanding about data and its use! No matter how good a data methodology is, it’s as weak as the weakest link in its use, and typically the issues revolving around data and data understanding are the weakest link.

Besides technical people, few businesspeople understand the full extent of managing data and its lifecycle. Unfortunately, even if some of the topics are treated in the books, they are too dry, need hands on experience and some thought in corroborating practices with theories. Without this, people will do things mechanically, processes being as good as the people using them, their value becoming suboptimal and hinder the business. That’s why training on Data Management is not enough without some hands-on experience!

The most important impact is however in BI/Data Analytics areas - how the various artifacts are created and used as support in decision-making, process optimization and other activities rooted in data. Ideally, some KPIs and other metrics should be enough for managing and directing a business, however just basing the decisions on a set of KPIs without understanding the bigger picture, without having a feeling of the data and their quality, the whole architecture, no matter how splendid, can breakdown as sandcastle on a shore meeting the first powerful wave!

Sometimes it feels like organizations do things from inertia, driven by the forces of the moment, initiatives and business issues for which temporary and later permanent solutions are needed. The best chance for solving many of the issues would have been a long time ago, when the issues were still small to create any powerful waves within the organizations. Therefore, a lot of effort is sometimes spent in solving the consequences of decisions not made at the right time, and that can be painful and costly!

For building a good business one needs also a solid foundation. In the past it was enough to have a good set of products that are profitable. However, during the past decade(s) the rules of the game changed driven by the acerb competition across geographies, inefficiencies, especially in the data and process areas, costing organizations on the short and long term. Data Management in general and Data Quality in particular, even if they’re challenging to quantify, have the power to address by design many of the issues existing in organizations, if given the right chance!

Previous Post <<||>> Next Post

02 September 2024

🗄️Data Management: Data Culture (Part III: A Tale of Two Cities I)

One of the curious things is that as part of their change of culture organizations try to adopt a new language, to give new names to things, try to make distinction between the "AS IS" and "TO BE" states, insisting how the new image will replace the previous one. Occasionally, they even stress how bad things were in the past and how great will be in the future, trying to depict the future in vivid images.

Even if this might work occasionally, it tends to confuse people and this not necessarily because of the language and the metaphors used, or the fact that same people were in the same positions, but the lack of belief or conviction, respectively half-hearted enthusiasm personified by the parties. To "convert" people to new philosophies one needs to believe in them or mimic that in similar terms. The lack of conviction can easily have a false effect that spreads within the organization.

Dissociation from the past, from what an organization was, tends to increase the resistance against the new because two different images are involved. On one side there’s the attachment to the past, and even if there were mistakes made, or things didn’t go optimally, the experiences and decisions made are part of the organization, of the people who made them. People as individuals and as an organization should embrace their mistakes and good deeds altogether, learn from them, improve what is to improve and move forward. Conversely, there’s the resistance to the new, to the change, words they don’t believe in yet, the bigger picture is still fuzzy in their minds, and there can be many other reasons that don’t agree with one’s understanding.

There are images, memories, views, decisions, objectives of the past and people need to recognize the road from what it was to what should be. One can hypothesize that embracing one’s mistake and understanding, the chain of reasoning from then and from now will help an organization transition towards the new. Awareness of one’s situation most probably will help in the transition process. Unfortunately, leaders and technology gurus tend to depict the past as negative, creating thus more negative emotions, respectively reactions in the process. The past is still part of the people, of the organization and will continue to be.

Conversely, the disassociation from the past can create more resistance to the new, and probably more unnecessary barriers. Probably, it’s easier for the gurus to build the new if the past weren’t there! Forgetting the past would be an error because there are many lessons that can be still useful. All the experience needs to be redirected in new directions. It’s more important to help people see the vision of the future, understand their missions, the paths to be followed and the challenges ahead, .

It sounds more of a rambling from a psychology course, though organizations do have an image they want to change, to bring forth to cope with the various challenges, an image they want to reflect when needed. There are also organizations that want to change but keep their image intact, which leads to deeper conflicts. Unfortunately, changes of image involve conflicts that can become complex from what they bring forth.

A data culture should increase people’s awareness of the present, respectively of the future, of what it takes to bridge the gap, the challenges ahead, how to embrace change, how to keep a realistic perspective, how to do a reality check, etc. Methodologies can increase people’s awareness and provide the theoretical basis, though walking the path will be a different story for everyone.

Previous Post <<||>> Next Post

01 September 2024

🗄️Data Management: Data Governance (Part I: No Guild of Heroes)

Data Management Series

Data governance appeared around 1980s as topic though it gained popularity in early 2000s [1]. Twenty years later, organizations still miss the mark, respectively fail to understand and implement it in a consistent manner. As usual, the reasons for failure are multiple and they vary from misunderstanding what governance is all about to poor implementation of methodologies and inadequate management or leadership.

Moreover, methodologies tend to idealize the various aspects and is not what organizations need, but pragmatism. For example, data governance is not about heroes and heroism [2], which can give the impression that heroic actions are involved and is not the case! Actions for the sake of action don’t necessarily lead to change by themselves. Organizations are in general good at creating meaningless action without results, especially when people preoccupy themselves, miss or ignore the mark. Big organizations are very good at generating actions without effects.

People do talk to each other, though they try to solve their own problems and optimize their own areas without necessarily thinking about the bigger picture. The problem is not necessarily communication or the lack of depth into business issues, people do communicate, know the issues without a business impact assessment. The challenge is usually in convincing the upper management that the effort needs to be consolidated, supported, respectively the needed resources made available.

Probably, one of the issues with data governance is the attempt of creating another structure in the organization focused on quality, which has the chances to fail, and unfortunately does fail. Many issues appear when the structure gains weight and it becomes a separate entity instead of being the backbone of organizations.

As soon organizations separate the data governance from the key users, management and the other important decisional people in the organization, it takes a life of its own that has the chances to diverge from the initial construct. Then, organizations need "alignment" and probably other big words to coordinate the effort. Also such constructs can work but they are suboptimal because the forces will always pull in different directions.

Making each manager and the upper management responsible for governance is probably the way to go, though they’ll need the time for it. In theory, this can be achieved when many of the issues are solved at the lower level, when automation and further aspects allow them to supervise things, rather than hiding behind every issue.

When too much mircomanagement is involved, people tend to busy themselves with topics rather than solve the issues they are confronted with. The actual actors need to be empowered to take decisions and optimize their work when needed. Kaizen, the philosophy of continuous improvement, proved itself that it works when applied correctly. They’ll need the knowledge, skills, time and support to do it though. One of the dangers is however that this becomes a full-time responsibility, which tends to create a separate entity again.

The challenge for organizations lies probably in the friction between where they are and what they must do to move forward toward the various objectives. Moving in small rapid steps is probably the way to go, though each person must be aware when something doesn’t work as expected and react. That’s probably the most important aspect.

So, the more functions are created that diverge from the actual organization, the higher the chances for failure. Unfortunately, failure is visible in the later phases, and thus self-awareness, self-control and other similar “qualities” are needed, like small actors that keep the system in check and react whenever is needed. Ideally, the employees are the best resources to react whenever something doesn’t work as per design.

Previous Post <<||>> Next Post

Resources:
[1] Wikipedia (2023) Data Management [link]
[2] Tiankai Feng (2023) How to Turn Your Data Team Into Governance Heroes [link]

29 March 2024

🗄️🗒️Data Management: Data [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 29-Mar-2024

[Data Management] Data

{def} raw, unrelated numbers or entries that represent facts, concepts, events, and/or associations
categorized by

domain

{type} transactional data
{type} master data
{type} configuration data

{subtype}hierarchical data
{subtype} reference data
{subtype} setup data
{subtype} policy

{type} analytical data

{subtype} measurements
{subtype} metrics
{subtype}

structuredness

{type} structured data
{type} semi-structured data
{type} unstructured data

statistical usage as variable

{type} categorical data (aka qualitative data)

{subtype} nominal data
{subtype} ordinal data
{subtype} binary data

{type} numerical data (aka quantitative data)

{subtype} discrete data
{subtype} continuous data

size

{type} small data
{type} big data

{concept} transactional data

{def} data that describe business transactions and/or events
supports the daily operations of an organization
commonly refers to data created and updated within operational systems
support applications that automated key business processes
usually stored in normalized tables

{concept} master data

{def}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]

the key business entities on which transaction are executed

the dimensions around on which analysis is conducted

used to categorize, evaluate and aggregate transactional data

can be shared across more than one transactional applications
there are master data similar to most organizations, but also master data specific to certain industries
often appear in more than one area within the business
represent one version of the truth
can be further divided into specialized subsets
{concept} master data entity

core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies
may be classified within a hierarchy

the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways

e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization

do not change as frequent like transactional data

less volatile than transactional data
there are master data that don’t change at all

e.g. geographic locations

strategic asset of the business
needs to be managed with the same diligence as other strategic assets

{concept} metadata

{definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]

data about data

refers to

database schemas for OLAP & OLTP systems
XML document schemas
report definitions
additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
application configuration data

{concept} analytical data

{definition} data that supports analytical activities

e.g. decision making, reporting queries and analysis

comprises

numerical values
metrics
measurements

stored in OLAP repositories

optimized for decision support
enterprise data warehouses
departmental data marts
within table structures designed to support aggregation, queries and data mining

{concept} hierarchical data
- {definition} data that reflects a hierarchy
- typically appears in analytical applications
- {concept} hierarchy
{concept} structured data

{definition} "data that has a strict metadata defined"

{concept} unstructured data

{definition} data that doesn't follow predefined metadata
involves all kinds of documents
can appear in a database, in a file, or even in printed material

{concept} semi-structured data

{definition} structured data stored within unstructured data,
data typically in XML form

XML is widely used for data exchange

can appear in stand-alone files or as part of a database (as a column in a table)
useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema

Previous Post <<||>> Next Post

References:
[1] The Art of Service (2017) Master Data Management Course

[2] DAMA International (2011) "The DAMA Dictionary of Data Management",

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 28-Mar-2024

Master Data Management (MDM)

{definition} the technologies, processes, policies, standards and guiding principles that enable the management of master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities [2],[3]
{goal} enable sharing of information assets across business domains and applications within an organization [4]
{goal} provide authoritative source of reconciled and quality-assessed master (and reference) data [4]
{goal} lower cost and complexity through use of standards, common data models, and integration patterns [4]
{driver} meeting organizational data requirements
{driver} improving data quality
{driver} reducing the costs for data integration
{driver} reducing risks
{type} operational MDM

involves solutions for managing transactional data in operational applications [1]
rely heavily on data integration technologies

{type} analytical MDM

involves solutions for managing analytical master data
centered on providing high quality dimensions with multiple hierarchies [1]
cannot influence operational systems

any data cleansing made within operational application isn’t recognized by transactional applications [1]

⇒ inconsistencies to the main operational data [1]

transactional application knowledge isn’t available to the cleansing process

{type} enterprise MDM

involves solutions for managing both transactional and analytical master data

manages all master data entities
deliver maximum business value

operational data cleansing

improves the operational efficiencies of the applications and the business processes that use the applications

cross-application data need

consolidation
standardization
cleansing
distribution

needs to support high volume of transactions

⇒ master data must be contained in data models designed for OLTP

⇐ ODS don’t fulfill this requirement

{enabler} high-quality data
{enabler} data governance
{benefit} single source of truth

used to support both operational and analytical applications in a consistent manner [1]

{benefit} consistent reporting

reduces the inconsistencies experienced previously
influenced by complex transformations

{benefit} improved competitiveness

MDM reduces the complexity of integrating new data and systems into the organization

⇒ increased flexibility and improves competitiveness

ability to react to new business opportunities quickly with limited resources

{benefit} improved risk management

more reliable and consistent data improves the business’s ability to manage enterprise risk [1]

{benefit} improved operational efficiency and reduced costs

helps identify business’ pain point

by developing a strategy for managing master data

{benefit} improved decision making

reducing data inconsistency diminishes organizational data mistrust and facilitates clearer (and faster) business decisions [1]

{benefit} more reliable spend analysis and planning

better data integration helps planners come up with better decisions

improves the ability to

aggregate purchasing activities
coordinate competitive sourcing
be more predictable about future spending
generally improve vendor and supplier management

{benefit} regulatory compliance

allows to reduce compliance risk

helps satisfy governance, regulatory and compliance requirements

simplifies compliance auditing

enables more effective information controls that facilitate compliance with regulations

{benefit} increased information quality

enables organizations to monitor conformance more effectively

via metadata collection
it can track whether data meets information quality expectations across vertical applications, which reduces information scrap and rework

{benefit} quicker results

reduces the delays associated with extraction and transformation of data [1]

⇒ it speeds up the implementation of application migrations, modernization projects, and data warehouse/data mart construction [1]

{benefit} improved business productivity

gives enterprise architects the chance to explore how effective the organization is in automating its business processes by exploiting the information asset [1]

⇐ master data helps organizations realize how the same data entities are represented, manipulated, or exchanged across applications within the enterprise and how those objects relate to business process workflows [1]

{benefit} simplified application development

provides the opportunity to consolidate the application functionality associated with the data lifecycle [1]

⇐ consolidation in MDM is not limited to the data
⇒ provides a single functional to which different applications can subscribe

⇐ introducing a technical service layer for data lifecycle functionality provides the type of abstraction needed for deploying SOA or similar architectures

factors to consider for implementing an MDM:

effective technical infrastructure for collaboration [1]
organizational preparedness

for making a quick transition from a loosely combined confederation of vertical silos to a more tightly coupled collaborative framework
{recommendation} evaluate the kinds of training sessions and individual incentives required to create a smooth transition [1]

metadata management

via a metadata registry

{recommendation} sets up a mechanism for unifying a master data view when possible [1]
determines when that unification should be carried out [1]

technology integration

{recommendation} diagnose what technology needs to be integrated to support the process instead of developing the process around the technology [1]

anticipating/managing change

proper preparation and organization will subtly introduce change to the way people think and act as shown in any shift in pattern [1]
changes in reporting structures and needs are unavoidable

creating a partnership between Business and IT

IT roles

plays a major role in executing the MDM program[1]

business roles

identifying and standardizing master data [1]
facilitating change management within the MDM program [1]
establishing data ownership

measurably high data quality
overseeing processes via policies and procedures for data governance [1]

{challenge} establishing enterprise-wide data governance

{recommendation} define and distribute the policies and procedures governing the oversight of master data

seeking feedback from across the different application teams provides a chance to develop the stewardship framework agreed upon by the majority while preparing the organization for the transition [1]

{challenge} isolated islands of information

caused by vertical alignment of IT

makes it difficult to fix the dissimilarities in roles and responsibilities in relation to the isolated data sets because they are integrated into a master view [1]

caused by data ownership

the politics of information ownership and management have created artificial exclusive domains supervised by individuals who have no desire to centralize information [1]

{challenge} consolidating master data into a centrally managed data asset [1]

transfers the responsibility and accountability for information management from the lines of business to the organization [1]

{challenge} managing MDM

MDM should be considered a program and not a project or an application [1]

{challenge} achieving timely and accurate synchronization across disparate systems [1]
{challenge} different definitions of master metadata
- different coding schemes, data types, collations, and more
{challenge} data conflicts

{recommendation} resolve data conflicts during the project [5]
{recommendation} replicate the resolved data issues back to the source systems [5]

{challenge} domain knowledge

{recommendation} involve domain experts in an MDM project [5]

{challenge} documentation

{recommendation} properly document your master data and metadata [5]

approaches

{architecture} no central MDM

isn’t a real MDM approach
used when any kind of cross-system interaction is required [5]

e.g. performing analysis on data from multiple systems, ad-hoc merging and cleansing

{drawback} very inexpensive at the beginning; however, it turns out to be the most expensive over time [5]

{architecture} central metadata storage

provides unified, centrally maintained definitions for master data [5]

followed and implemented by all systems

ad-hoc merging and cleansing becomes somewhat simpler [5]
does not use a specialized solution for the central metadata storage [5]

⇐ the central storage of metadata is probably in an unstructured form

e.g. documents, worksheets, paper

{architecture} central metadata storage with identity mapping

stores keys that map tables in the MDM solution

only has keys from the systems in the MDM database; it does not have any other attributes [5]

{benefit} data integration applications can be developed much more quickly and easily [5]
{drawback} raises problems in regard to maintaining master data over time [5]

there is no versioning or auditing in place to follow the changes [5]

⇒ viable for a limited time only

e.g. during upgrading, testing, and the initial usage of a new ERP system to provide mapping back to the old ERP system

{architecture} central metadata storage and central data that is continuously merged

stores metadata as well as master data in a dedicated MDM system
master data is not inserted or updated in the MDM system [5]
the merging (and cleansing) of master data from source systems occurs continuously, regularly [5]
{drawback} continuous merging can become expensive [5]
the only viable use for this approach is for finding out what has changed in source systems from the last merge [5]

enables merging only the delta (new and updated data)

frequently used for analytical systems

{architecture} central MDM, single copy

involves a specialized MDM application

master data, together with its metadata, is maintained in a central location [5]
⇒ all existing applications are consumers of the master data

{drawback} upgrade all existing applications to consume master data from central storage instead of maintaining their own copies [5]

⇒ can be expensive
⇒ can be impossible (e.g. for older systems)

{drawback} needs to consolidate all metadata from all source systems [5]
{drawback} the process of creating and updating master data could simply be too slow [5]

because of the processes in place

{architecture} central MDM, multiple copies

uses central storage of master data and its metadata

⇐ the metadata here includes only an intersection of common metadata from source systems [5]
each source system maintains its own copy of master data, with additional attributes that pertain to that system only [5]

after master data is inserted into the central MDM system, it is replicated (preferably automatically) to source systems, where the source-specific attributes are updated [5]
{benefit} good compromise between cost, data quality, and the effectiveness of the CRUD process [5]
{drawback} update conflicts

different systems can also update the common data [5]

⇒ involves continuous merges as well [5]

{drawback} uses a special MDM application

Previous Post <<||>> Next Post

Acronyms:

MDM - Master Data Management

ODS - Operational Data Store

OLAP - online analytical processing

OLTP - online transactional processing

SOA - Service Oriented Architecture

References:
[1] The Art of Service (2017) Master Data Management Course
[2] DAMA International (2009) "The DAMA Guide to the Data Management Body of Knowledge" 1st Ed.

[3] Tony Fisher 2009 "The Data Asset"

[4] DAMA International (2017) "The DAMA Guide to the Data Management Body of Knowledge" 2nd Ed.

[5] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

SQL Troubles

Pages

12 July 2026

🎯Fadi Maali - Collected Quotes

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

15 October 2024

🗄️Data Management: Data Governance (Part III: Taming the Complexity)

17 September 2024

#️⃣Software Engineering: Mea Culpa (Part V: All-Knowing Developers are Back in Demand?)

14 September 2024

🗄️Data Management: Data Governance (Part II: Heroes Die Young)

🗄️Data Management: Data Culture (Part V: Quid nunc? [What now?])

11 September 2024

🗄️Data Management: Data Culture (Part IV: Quo vadis? [Where are you going?])

02 September 2024

🗄️Data Management: Data Culture (Part III: A Tale of Two Cities I)

01 September 2024

🗄️Data Management: Data Governance (Part I: No Guild of Heroes)

29 March 2024

🗄️🗒️Data Management: Data [Notes]

28 March 2024

🗄️🗒️Data Management: Master Data Management [MDM] [Notes]

About Me