SQL Troubles: audit

Showing posts with label audit. Show all posts

10 October 2023

💫Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

Data Warehousing Series

Introduction

Besides the fact that data professionals don't have direct access to D365 F&O production environments (direct access is available only to sandboxes), which was from the beginning an important constraint imposed by the architecture, there are a few more challenges that need to be addressed when working with the data.

Case Sensitiveness

SQL Server is not case sensitive, therefore, depending on the channel though which the data came, values appear either in upper or lower case, respectively a mixture of both. Even if this isn't an issue in D365, it can become an issue when the data leave the environment. E.g., PowerQuery is case sensitive (while DAX is case insensitive), thus, if a field containing a mix of values participate in a join or aggregation, this will result in unexpected behavior (e.g., duplicates, records ignored). It's primarily the case of the Company (aka DataAreaId) field available in most of the important tables.

The ideal solution would be to make sure that the values are correct by design, however this can't be always enforced. Otherwise, when using the data outside of D365 F&O the solution would be to transform all the values in upper case (or lower case). However, also this step might occur too late. E.g., when the data are exported to the Azure Data Lake in parquet file format.

Unique Keys

A unique record in D365 F&O was in earlier versions usually identified by the RecId and DataAreaId, while later the Partition field was added. This means that most of the joins will need to consider all 3 columns, which adds some overhead. In some environments there's only a Partition defined (and thus the field can be ignored), however this is not a warranty.

As long developers use SQL there's no issue of using multiple fields in JOINs, though in PowerQuery there must be created a unique key based on the respective records so the JOINs are possible. Actually, also SQL-based JOINs would benefit if each record would be identified by one field.

Audit Metadata

Not all tables have fields that designate the date when a record was created or last modified, respectively the user who performed the respective action. The fields can be added manually when setting up the system, however that's seldom done. This makes it difficult to audit the records and sometimes it's a challenge also for reporting, respectively for troubleshooting the differences between DWH and source system. Fortunately, the Export to Data Lake adds a timestamp reflecting the time when the record was synchronized, though it can be used then only for the records synchronized after the first load.

Tables vs. Entities

Data are modified in D365 F&O via a collection of entities, which are nothing but views that encapsulate the business logic, being based on the base tables or other views, respectively a combination of both. The Export to Data Lake (*) is based on the tables, while Link to Data Lake is based on data entities.

Using the base tables means that the developer must reengineer the logic from the views. For some cases it might work to create the entities as views in the DWH environment though some features might not be supported. It's the case of serverless and dedicated SQL pools, that support only a subset from the features available under standard Azure SQL Server.

The developer can try to replicate the logic from entities, considering only the logic needed by the business, especially when only a subset from the functionality available in the entity was used. The newly created views can become thus more readable and maintainable. On the other side, if the logic in entity changed, the changes need to be reflected also in the DWH views.

Using the entity-based data makes sure that the data are consistent between environments. Unfortunately, Microsoft found out that isn't so easy to synchronize the data at entity level. Moreover, there are multiple entities based on the same table that reflect only a subset of the columns or rows. Thus, to cover all the fields from a base table, one might be forced to synchronize multiple views, leading thus to data duplication.

In theory, both mechanisms can be used within the same environment, even if this approach is against the unique source of truth principle, when data are duplicated.

Data Validation in the Data Lake

One scenario in which both sources are useful is when validating whether the synchronization mechanism worked as expected. Thus, one can compare the number of records and check whether there are differences that can't be mitigated. However, does it make sense to "duplicate" database objects only for this purpose?

Ideally, to validate whether a record was synchronized should be done in the source environment (e.g. via a timestamp). That's difficult to achieve, especially when there's no direct access to the source database (as is the case for Production databases). Fortunately, Dataverse provides this functionality, even if might not be bullet proof.

In extremis, the most reliable approach is to copy the production environment on a sandbox and do a count of records for each table, using as baseline for comparison the time when the refresh occurred.

Base Enum Values

The list of values that don't have their own tables are managed within the application as Base Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values. Some of the mappings between the codes and values are documented in two system tables, and even in old language-based documentation, though both sources are far from complete. As alternative, one can try to discover the values in the system.

Unfortunately, the mappings need to be repeated when the Enum-based attributed is used in multiple places. One can reduce mapping's duplication by encapsulating the logic into a view (aka "base view") and reused accordingly (see the logic for TDM.vEcoResProduct).

Even if the values for many of the Enums are stored into the EnumValueTable table, Enum's name being available in EnumIdTable table, it's not a good idea to retrieve the values via a JOIN in the business logic. This would complicate the business logic unnecessarily. A CASE is more efficient even if occasionally more difficult to maintain. Unfortunately, there's no timestamp to identify which values were added lately.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

06 March 2017

⛏️Data Management: Audit Trail (Definitions)

"Audit records stored in the sybsecurity database." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"A record of what happened to data from its inception to its current state. Audit trails help verify the integrity of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Data maintained to trace activity, such as a transaction log, for purposes of recovery or audit." (Craig S Mullins, "Database Administration", 2012)

"A chronological record of activities on information resources that enables the reconstruction and examination of sequences of activities on those information resources for later review." (Mark Rhodes-Ousley, "Information Security: The Complete Reference, Second Edition" 2nd Ed., 2013)

"A trace of a sequence of events in a clerical or computer system. This audit usually identifies the creation or modification of any element in the system, who did it, and (possibly) why it was done." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A chronological record of events or transactions. An audit trail is used for examining or reconstructing a sequence of events or transactions, managing security, and recovering lost transactions." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A path by which the original input to a process (e.g. data) can be traced back through the process, taking the process output as a starting point. This facilitates result checking and allows a process audit to be carried out [after TMap]." (Software Quality Assurance)

09 March 2007

🌁Software Engineering: Audit (Definitions)

"In CMMI process improvement work, an independent examination of a work product or set of work products to determine whether requirements are being met." (Sandy Shrum et al, "CMMI: Guidelines for Process Integration and Product Improvement", 2003)

"An independent examination of work products or work processes to assess compliance with defined processes, procedures, standards, specifications, or other criteria." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"An inspection of the plans, procedures, or records of a part of a business to determine whether or not a plan was followed and if a desired outcome was achieved. In this book, an audit looks into various aspects of a product launch or a bidding situation (win/loss)." (Steven Haines, "The Product Manager's Desk Reference", 2008)

"In the context of security, a review of a system in order to validate the security of the system. Generally, this either refers to code auditing or reviewing audit logs." (Mark S Merkow & Lakshmikanth Raghavan, "Secure and Resilient Software Development", 2010)

"Review of a company’s financial and accounting records and supporting documents by a professional, such as a certified public accountant. This also refers to an examination of an individual’s or a corporation’s tax returns to verify accuracy." (Sue Johnson & Gwen Moran, "The Complete Idiot's Guide® To Business Plans", 2010)

"An independent evaluation of software products or processes to ascertain compliance to standards, guidelines, specifications, and/or procedures based on objective criteria, including documents that specify the following:
- The form or content of the products to be produced
- The process by which the products shall be produced
- How compliance to standards or guidelines shall be measured." (Tilo Linz et al, "Software Testing Foundations" 4th Ed, 2014)

"The systematic, independent and documented process for obtaining audit evidence and evaluating it objectively to determine the extent to which the audit criteria are fulfilled" (David Sutton, "Information Risk Management: A practitioner’s guide", 2014)

"An independent assessment that takes a well-defined approach to examining an organization’s internal policies, controls, and activities." (Weiss, "Auditing IT Infrastructures for Compliance" 2nd Ed, 2015)

"A systematic assessment of significant importance to the organization that determines whether the system or process being audited satisfies some external standards." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed, 2018)

"An independent evaluation of software products or processes to ascertain compliance to standards, guidelines, specifications, and/or procedures based on objective criteria, including documents that specify: (1) The form or content of the products to be produced (2) The process by which the products shall be produced (3) How compliance to standards or guidelines shall be measured [IEEE 1028]." (Software Quality Assurance)

"Formal inspection and verification to check whether a standard or set of guidelines is being followed, that records are accurate, or that efficiency and effectiveness targets are being met" (ITIL)

SQL Troubles

Pages