SQL Troubles: compliance

Showing posts with label compliance. Show all posts

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

Data Migrations Series

Auditability

Auditability is the degree to which the solution allows checking the data for their accuracy, or for their quality in general, respectively the degree to which the DM solution and processes allow to be audited regarding compliance, security and other types of requirements. All these aspects are important in case an external sign-off from an auditor is mandatory.

Automation

Automation is the degree to which the activities within a DM can be automated. Ideally all the processes or activities should be automated, though other requirements might be impacted negatively. Ideally, one needs to find the right balance between the various requirements.

Cohesion

Cohesion is the degree to which the tasks performed by the solution, respectively during the migration, are related to each other. Given the dependencies existing between data, their processing and further project-related activities, DM imply a high degree of cohesion that need to be addressed by design.

Complexity

Complexity is the degree to which a solution is difficult to understand given the various processing layers and dependencies existing within the data. The complexity of DM revolve mainly around the data structures and the transformations needed to translate the data between the various data models.

Compliance

Compliance is the degree to which a solution is compliant with internal or external regulations that apply. There should be differentiated between mandatory requirements, respectively recommendations and other requirements.

Consistency

Consistency is the degree to which data conform to an equivalent set of data, in this case the entities considered for the DM need to be consistent to each other. A record referenced in any entity of the migration need to be considered, respectively made available in the target system(s) either by parametrization or migration.

During each iteration, the data need to remain consistent, so it can facilitate the troubleshooting. The data are usually reimported between iterations or during same iteration, typically to reflect the changes occurred in the source systems or other purposes.

Coupling

Data coupling is the degree to which different processing areas within a DM share the same data, typically a reflection of the dependencies existing between the data. Ideally, the areas should be decoupled as much as possible.

Extensibility

Extensibility is the degree to which the solution or parts of the logic can be extended to accommodate further requirements. Typically, this involves changes that deviate from the standard functionality. Extensibility impacts positively the flexibility.

Flexibility

Flexibility is the degree to which a solution can handle new requirements or ad-hoc changes to the logic. No matter how good everything was planned there’s always something forgotten or new information is identified. Having the flexibility to change code or data on the fly can make an important difference.

Integrity

Integrity is the degree to which a solution prevents the changes to data besides the ones considered by design. Users and processes should not be able modifying the data besides the agreed procedures. This means that the data need to be processed in the sequence agreed. All aspects related to data integrity need to be documented accordingly.

Interoperability

Interoperability is the degree to which a solution’s components can exchange data and use the respective data. The various layers of a DM’s solutions must be able to process the data and this should be possible by design.

Maintainability

Maintainability is the degree to which a solution can be modified to or add minor features, change existing code, corrects issues, refactor code, improve performance or address changes in environment. The data required and the transformation rules are seldom known in advance. The data requirements are definitized during the various iterations, the changes needing to be implemented as the iterations progress. Thus, maintainability is a critical requirement.

Previous Post - Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

30 January 2019

🤝Governance: Compliance (Definitions)

"(1) Conforming or acquiescing to requirements from a third party. (2) A subset of data retention policies and procedures that must adhere to more rigid and rigorous conditions." (David G Hill, "Data Protection: Governance, Risk Management, and Compliance", 2009)

"The successful fulfillment of regulations, usually set by a financial institution (for borrowing purposes) or industry standards." (Annetta Cortez & Bob Yehling, "The Complete Idiot's Guide® To Risk Management", 2010)

"The process of conforming, completing, performing, or adapting actions to meet the rules, demands, or wishes of another party. Commonly used when discussing conformance to external government or industry regulations." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures 2nd Ed", 2012)

"The ability to operate in the way defined by a regulation. Many organizations are introduced to governance concepts as they begin the process of complying with business regulations, such as Sarbanes|Oxley or Basel II. These regulations are enforced by audits that determine whether business decisions were made by the appropriate staff according to appropriate policies. To pass these audits, organizations must document their decision rights, policies, and records, specifically that each of the decisions was in fact made by the appropriate person according to policy." (Paul C Dinsmore et al, "Enterprise Project Governance", 2012)

"A general concept of conforming to a rule, standard, law, or requirement such that the assessment of compliance results in a binomial result stated as 'compliant' or 'noncompliant'." (For Dummies, "PMP Certification All-in-One For Dummies, 2nd Ed.", 2013)

"Business rules enforced by legislation or some other governing body" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Compliance refers to a strategy and a set of activities and artifacts that allow teams to apply Lean-Agile development methods to build systems that have the highest possible quality, while simultaneously assuring they meet any regulatory, industry, or other relevant standards." (Dean Leffingwell, "SAFe 4.5 Reference Guide: Scaled Agile Framework for Lean Enterprises 2nd Ed", 2018)

"Ensuring that a standard or set of guidelines is followed, or that proper, consistent accounting or other practices are being employed." (ITIL)

"The capability of the software product to adhere to standards, conventions or regulations in laws and similar prescriptions." [ISO 9126]

SQL Troubles

Pages