SQL Troubles: Unique Identifier

Showing posts with label Unique Identifier. Show all posts

10 October 2023

💫Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

Data Warehousing Series

Introduction

Besides the fact that data professionals don't have direct access to D365 F&O production environments (direct access is available only to sandboxes), which was from the beginning an important constraint imposed by the architecture, there are a few more challenges that need to be addressed when working with the data.

Case Sensitiveness

SQL Server is not case sensitive, therefore, depending on the channel though which the data came, values appear either in upper or lower case, respectively a mixture of both. Even if this isn't an issue in D365, it can become an issue when the data leave the environment. E.g., PowerQuery is case sensitive (while DAX is case insensitive), thus, if a field containing a mix of values participate in a join or aggregation, this will result in unexpected behavior (e.g., duplicates, records ignored). It's primarily the case of the Company (aka DataAreaId) field available in most of the important tables.

The ideal solution would be to make sure that the values are correct by design, however this can't be always enforced. Otherwise, when using the data outside of D365 F&O the solution would be to transform all the values in upper case (or lower case). However, also this step might occur too late. E.g., when the data are exported to the Azure Data Lake in parquet file format.

Unique Keys

A unique record in D365 F&O was in earlier versions usually identified by the RecId and DataAreaId, while later the Partition field was added. This means that most of the joins will need to consider all 3 columns, which adds some overhead. In some environments there's only a Partition defined (and thus the field can be ignored), however this is not a warranty.

As long developers use SQL there's no issue of using multiple fields in JOINs, though in PowerQuery there must be created a unique key based on the respective records so the JOINs are possible. Actually, also SQL-based JOINs would benefit if each record would be identified by one field.

Audit Metadata

Not all tables have fields that designate the date when a record was created or last modified, respectively the user who performed the respective action. The fields can be added manually when setting up the system, however that's seldom done. This makes it difficult to audit the records and sometimes it's a challenge also for reporting, respectively for troubleshooting the differences between DWH and source system. Fortunately, the Export to Data Lake adds a timestamp reflecting the time when the record was synchronized, though it can be used then only for the records synchronized after the first load.

Tables vs. Entities

Data are modified in D365 F&O via a collection of entities, which are nothing but views that encapsulate the business logic, being based on the base tables or other views, respectively a combination of both. The Export to Data Lake (*) is based on the tables, while Link to Data Lake is based on data entities.

Using the base tables means that the developer must reengineer the logic from the views. For some cases it might work to create the entities as views in the DWH environment though some features might not be supported. It's the case of serverless and dedicated SQL pools, that support only a subset from the features available under standard Azure SQL Server.

The developer can try to replicate the logic from entities, considering only the logic needed by the business, especially when only a subset from the functionality available in the entity was used. The newly created views can become thus more readable and maintainable. On the other side, if the logic in entity changed, the changes need to be reflected also in the DWH views.

Using the entity-based data makes sure that the data are consistent between environments. Unfortunately, Microsoft found out that isn't so easy to synchronize the data at entity level. Moreover, there are multiple entities based on the same table that reflect only a subset of the columns or rows. Thus, to cover all the fields from a base table, one might be forced to synchronize multiple views, leading thus to data duplication.

In theory, both mechanisms can be used within the same environment, even if this approach is against the unique source of truth principle, when data are duplicated.

Data Validation in the Data Lake

One scenario in which both sources are useful is when validating whether the synchronization mechanism worked as expected. Thus, one can compare the number of records and check whether there are differences that can't be mitigated. However, does it make sense to "duplicate" database objects only for this purpose?

Ideally, to validate whether a record was synchronized should be done in the source environment (e.g. via a timestamp). That's difficult to achieve, especially when there's no direct access to the source database (as is the case for Production databases). Fortunately, Dataverse provides this functionality, even if might not be bullet proof.

In extremis, the most reliable approach is to copy the production environment on a sandbox and do a count of records for each table, using as baseline for comparison the time when the refresh occurred.

Base Enum Values

The list of values that don't have their own tables are managed within the application as Base Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values. Some of the mappings between the codes and values are documented in two system tables, and even in old language-based documentation, though both sources are far from complete. As alternative, one can try to discover the values in the system.

Unfortunately, the mappings need to be repeated when the Enum-based attributed is used in multiple places. One can reduce mapping's duplication by encapsulating the logic into a view (aka "base view") and reused accordingly (see the logic for TDM.vEcoResProduct).

Even if the values for many of the Enums are stored into the EnumValueTable table, Enum's name being available in EnumIdTable table, it's not a good idea to retrieve the values via a JOIN in the business logic. This would complicate the business logic unnecessarily. A CASE is more efficient even if occasionally more difficult to maintain. Unfortunately, there's no timestamp to identify which values were added lately.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

31 January 2010

🧭Business Intelligence: Enterprise Reporting (Part V: Choosing Report’s Attributes)

Business Intelligence Series

Introduction

How are chosen the attributes of a report? Attributes are added primarily based on users’ specifications, however often they can be too high level or the user ignored willingly or by mistake certain aspects. In general in a report is need to be shown the attributes of high relevance to a certain topic, for example Document information (Document Number, Type, Dates, Statuses, etc.), Product main information (Product Number, Description, Type, Status, etc.), Quantities, Prices, Amounts, Responsible Users (e.g. Buyers, Preparers, Managers, etc.) or Responsible Third Parties (e.g. Customers, Vendors, Carriers).

When choosing the attributes for a report, there are several important sets of attributes which needs to be considered:

Unique identifiers

Together with the various Names (e.g. Vendor Name, Customer Name) associated with entities, a report should include also the “unique identifier” (UID) for each entity, even if formed from one or more attributes. The UID allows identifying for example if duplicate records appear in report or it could be used to match/join the data from the reports with other data sets in order to pull details or for further analysis of data. For example in a PO report over PO Shipments a unique (natural) key could be identified by using the PO Number, Line Number and Shipment Number; for a Vendor could be used the Vendor Name or the GSL (Global Supplier Location) Number, though the later it’s more adequate because it’s more general and accurate, making easier Vendor’s identification. In theory, for the same scope could be used also the database (surrogate) unique identifier from PO Shipments table, the elements dictating report’s level of detail, respectively the Vendor ID, though even if surrogate UID are easier to use in joins, they could create confusion and overload the reports, given that surrogate UIDs need to be provided also for the other elements.

Documents like Invoices include an external and internal unique identifier, the Invoice Number together with the Vendor, typically unique in a system, form the external UID, while the Document or Voucher Number is used as internal UID. The external UID it’s easier to use for external-based considerations, while the internal UID it’s easier to use for internal needs, so it makes sense to include both types of unique identifiers.

Quantities & Related Attributes

In Item-related reports, most of the times it makes sense to include also the quantities (e.g. Transaction, Ordered, Delivered, Invoiced, On-Hand Quantities) together with the Unit of Measure (UOM) in which they are represented. It has to be made distinction between the Primary UOM, the UOM in which the item is stored, and the Transactional UOM, the UOM in which the Item is transacted; for example the Purchasing UOM, Sales UOM or Transaction UOM could be different than the Primary UOM in which the item is stored in Warehouse. In such cases together with the Transactional UOM should be provided also the Primary UOM and eventually the UOM Conversion Rate, when applicable.

Prices/Amounts & Related Attributes

For Item-related reports and not only, include the various Prices (e.g. Sales, Purchase, Standard Price) together with the Currency Code used even if only one Currency is used, same rule applying also for the amounts stored (e.g. Invoice, Sales Order, Purchase amounts). For financial reports it’s advisable to show both functional amounts, the amounts in the Currency used by GL (General Ledger), and transactional amounts, the Currency used in the transaction. When the level of details allows it, show also the Quantity, Price Unit used to calculate the amounts, the eventual Exchange Rate or UOM Conversion Rate used. When available, include also the Period when the Amount was booked in the system.

Dates

Typically should be included the Document Date (e.g. Invoice Date, Order Date) and Document Creation Date, together with the other Dates important for the business or data analysis (e.g. Need By Date, GL Date, Value Date). In general the Document Date or Document Creation Date, and GL Date for financial reports, should be mandatory attributes because they could be used to segment (partition) a data set in time units (e.g. days, weeks, months, periods, years, etc.).

Statuses

The various record statuses and document statuses should be again mandatory attributes in reports. Record statuses show whether a record is active, was cancelled or marked as deleted, while document statuses show documents’ processing status, often being associated with a workflow (e.g. approval or processing workflows). The record statuses could be synchronized and even merged with the document statuses.

Either expressed as flags or list of values, statuses are essential in delimiting the data set that needs to be considered for further calculations, because often not approved documents or cancelled records have low or no relevance for the business. Not approved documents are typically not considered for the various calculations until they were not approved, while cancelled records are associated with mistakes or the lack of need. Not being able to identify the active records can mess things pretty badly, because for example there are reports that show only active, while others show all the data available in a system. Therefore showing of statuses in reports can be important in the mitigation of differences between reports, especially when dealing with calculations.

It’s advisable to have the possibility to see also the cancelled records, for example in order to analyze the amount of waste expressed as overwork or for identifying the records that were cancelled by mistake.

In reports with multiple levels of details, it can be useful to show the statuses from all levels, as statuses might not be in sync or because they have different meaning. In theory, when the statuses are in synch and especially when considering cancellations, it should be enough to consider the status from the lowest level of detail from each logical entity (e.g. PO Shipment Status when considering PO, Invoice Line status when considering Invoices, both mentioned statuses when considering POs together with Invoices), though reality can prove to be a tough world for statuses, as programming errors and other business scenarios need to be considered.

Action Owners

Include Requestors, Document Preparers, Buyers, Managers or any other type of action owners, so a user can track the direct or indirect issues back to them.

Note:
Such attributes can be used as base to calculate/reflect action owner’s performance, fact that can infringe country or organization regulations so you need to check if there are any constraints in this direction and which set of attributes might be impacted. For example might be no problem to show the Buyer, though might be a problem to show information about who created/modified the record. Eventually if needed to calculate the performance at action owner level, substitute any attribute that can be used to identify a person with a random value, however if the mapping between the action owner and value used as substitute is known (in case unique identifiers are used) or easy to get (by checking records in the system), the data might be misused.

Previous Post <<||>> Next Post

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

Data Management Series

Uniqueness refers to "requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures" [1]. An alternative name is the one of duplicates, which stresses the existence of duplicate records within a dataset, the not-uniqueness, being a better indicator for the nonconformance especially when considering datasets.

Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all. The impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it.

For each type of entity there can be one or more attributes that allow identifying it uniquely, for example in case of a Vendor could be Vendor’s name and address. The more such attributes then more difficult becomes the identification of a Vendor; therefore even if such a set of attributes exists, like in Vendor’s case, it’s preferable to use instead a unique identifier, a numeric or alphanumeric key that identifies uniquely an entity.

A Vendor could be uniquely identified by the Vendor ID, though it allows unique identification of a Vendor only in a data repository, the chances being quite high to have another Vendor ID for the same entity in another data repository. Therefore, in order to guarantee the uniqueness of entities is prefer to use instead an attribute that has the same value indifferently from the data repository the entity is stored in (e.g. Vendor Number, Item Number, Customer Number, Asset Number, Sales Order Number, etc.). Such attributes could be enforced to be unique across a set of data repositories, though one of them must function as 'master'.

Multiple identifiers for the same entity may exist, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity.

Especially in reports this might be quite a problem as amounts that should appear for the same entity are split against two references, and even if they refer to the same entity report’s users might be not aware of it, the calculations based on such numbers not reflecting the reality. Imagine that you have booked the invoices against two references to the same Vendor, and you want to consider the first 10 Vendors with the highest volume spent; it might happen that the aggregate amounts for each of the respective references didn’t make it in the top 10, though when considered together they can even make it to the first position.

There are even situations in which the use of multiple identifiers for the same entity is inevitable, and important are here the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity.

The more systems within an enterprise that use/represent the same entity, the more complex is to manage the correct referencing; the logical thing to do is to have in place a master system that stores the global identifier and integrate it with the other systems in place or at least store the global identifier in each of the systems. Such identifiers used for master or even transactional data allow creating consolidated reports based on several systems and mitigating possible issues existing in the data.

Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. Take for example a Product Number's creation - supposing that a Product Number XYZ is created in an ERP system and one or more of the respective Product’s attributes are incorrect. In many cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. the Unit of Measure once Inventory was booked). Because an entity with the wrong attributes is inadequate to be used, a new Product Number is created for the same entity, the two identifiers coexisting for some time. (Some systems allow to build a relationship between the Products, showing that one Product replaces another going forward.)

Some systems could allow the deletion of the first entered Product though that’s not always possible. Even if the two Products Numbers are different, they point to the same Product entity, and even if the first entered Product is marked as obsolete, until this happens transactions might be already made with it, and again, it might be possible to revert the transactions back, though that’s not always possible, and that’s not all, once a Product is marked as obsolete, this doesn’t necessarily imply that a User won’t reactivate the Product in the near or far future. As can be seen, the scenarios are quite complex, not all the systems and processes being designed to handle them. If such deviations from uniqueness can’t be corrected in the legacy systems, at least during conversion and migration should be corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

SQL Troubles

Pages

10 October 2023

💫Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

31 January 2010

🧭Business Intelligence: Enterprise Reporting (Part V: Choosing Report’s Attributes)

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

About Me