23 January 2010

ETL and Synchronization

    In a previous post on ETL: SSIS packages vs. SQL code I was mentioning that there are three types of synchronization - of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data.

    The synchronization of scope is achieved by enforcing the same set of constraints to related data elements, usually involving parent-child relations like Invoices Header/ Lines, PO (Purchase Orders) Headers/Lines/Shipments/Distributions, the more general purpose being to keep the referential integrity between the various data elements – assuring for example that all Items appearing in Invoices or POs are in scope or that all Invoice Lines have the corresponding Invoice Headers, and vice-versa.

    The synchronization of business logic could be for example enforced for data sets coming from distinct systems, usually involving different architecture, or for data sets requiring similar logic, considering the cases when data needs to be duplicated with slightly differences; the simplest example I can recall from the later category being the creation of Item Master Data for the existing Inventory Organizations, the respective data sets needing to be created before the data are actually loaded.

    The synchronization of data requires working with the same unaltered data in the various threads of processing, either packages or SQL code; it’s quite simple as concept, though not always straightforward to achieve. For this purpose it makes sense to load the data in the local database, even for slowly changes data sources like data warehouses, and base the business logic on the respective local data set(s) rather than loading the data over and over again for each package, this involving also considerable additional network traffic. Why the alternative is not the best approach? Supposing two packages A and B are scheduled to run at the same time, even if the requests are sent simultaneously to the database, the simultaneously is relative, because most of the time the requests are queued, in general following to be processed in the order they came, though this depends on legacy system’s architecture and settings. Supposing that there is some time elapsed between the times the two requests are processed, there are good chances that one record was created, deleted or updated in between. The impact of changes in data could be minimum though unpredictable situations might result with unpredictable impact. There are chances to find the issue when loading the data in the destination system, this if adequate validation is performed by it, though there are good chances to load the without noticing anything, the issues being eventually discovered later.

    It should be targeted to cover all three types of synchronization even if it complicates the design more than expected, though I think it’s better to do defensive architecting/programming, the increase in complexity being relative if we’re considering the plumbing/redesign that needs to be done in order to fix the issues.

No comments: