Showing posts with label Synchronization. Show all posts
Showing posts with label Synchronization. Show all posts

23 January 2010

🧊🎡Data Warehousing: ETL (Part I: It's All about Synchronization)


Data Warehousing
Data Warehousing Series

In a previous post on ETL: SSIS packages vs. SQL code I mentioned that there are three types of synchronization - of scope, business logic and data. The first type targets synchronizing the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data.

The synchronization of scope is achieved by enforcing the same set of constraints to related data elements with the purpose of keeping the referential integrity between the various data elements. They involve parent-child relations like Purchase orders (PO) Headers/Lines or reference/referenced relations like PO Lines and Products or Vendors. Such relationships can involve datasets coming from distinct systems, usually involving different architecture - e.g., Products or Bill of Materials master data that could come from a Product Management Information (PMI) system.

The synchronization of business logic is usually a consequence of the synchronization of scope and concerns the set of business rules that must be applied in the logic used. For example, is need to consider Products or Vendors only for the open POs. Therefore, that would involve the duplication of logic for open POs across multiple database objects, logic that needs to be synchronized accordingly. One can reference directly the object encapsulating the logic for open POs, though that would create a recursive reference as the logic references the Products and/or Vendors as well. With one or two tricks this can be avoided.

Synchronizations can involve local logic, e.g. using exchange rates conversions and other transformations where a unique user-defined function could allow achieve this. No matter of the techniques used, one must make sure that all the logic is kept in synch.

The synchronization of data requires working with the same unaltered data in the various threads of processing, either SSIS packages or SQL code; that's quite simple as concept, though not always straightforward to achieve and occasionally ignored. For this purpose, it makes sense to load the data into an intermediary database or staging area, even for slowly changes data sources like data warehouses, and base the business logic on the respective local dataset(s) rather than loading the data repeatedly for each package from the source, which can also involve considerable additional network traffic. Why the alternative is not the best approach?

Supposing that two packages A and B are scheduled to run at the same time, even if the requests are sent simultaneously to the database, the simultaneously is relative, because most of the time the requests are queued, in general following to be processed in the order they came, though this depends on legacy system’s architecture and settings. Supposing that there is some time elapsed between the times the two requests are processed, there are good chances that one record was created, deleted, or updated in between.

The impact of changes in data could be minimum though strange situations might result with unpredictable impact. There are chances to find the issue when loading the data in the destination system, this if adequate validation is performed, though there are good chances for the issues to remain undiscovered until later, with all consequences resulting from this. Therefore, one should also build validation logic separately when feasible.

It should be targeted to cover all three types of synchronization even if it complicates the design of the solutions, as it's recommended in general to apply defensive architecting/programming. The increase in complexity is relative if one considers the effort needed for troubleshooting, plumbing or even redesign that needs to be done to fix the issues.


Created: Jan-2010, Last Reviewed: Mar-2024

06 February 2007

🌁Software Engineering: Synchronization (Definitions)

"Enforcing constraints on the ordering of events occurring in different UEs. This is primarily used to ensure that shared resources are accessed by a collection of UEs in such a way that the program is correct regardless of how the UEs are scheduled." (Beverly A Sanders, "Patterns for Parallel Programming", 2004)

"Coordination in the execution of multiple threads. The most common cases of synchronization occur when you provide mutually exclusive access to shared resources or gather all threads at a point in the code before they are allowed to proceed." (Clay Breshears, "The Art of Concurrency", 2009)

"Timekeeping which requires the coordination of events to operate a system in unison." (Chen Liu et al, "Simultaneous MultiThreading Microarchitecture", 2010)

"The coordination, of tasks or threads, in order to obtain the desired runtime order. Commonly used to avoid undesired race conditions." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A technique for coordinating threads or processes to have appropriate execution order." (Masoud Hemmatpour et al, "Cost Evaluation of Synchronization Algorithms for Multicore Architectures", 2018)

"In multiprocessing systems, joining multiple independent processes in order to reach an agreement or commit to a certain sequence of actions." (Gen'ichi Yasuda, "A Formal Approach to the Distributed Software Control for Automated Multi-Axis Manufacturing Machines", 2018)

"Process of coordinating two or more activities in time." (Dharmendra T Patel, "Distributed Computing for Internet of Things (IoT)", 2019)


Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.