04 May 2019

Data Warehousing: Push vs. Pull

Data Warehousing

In data integrations, data migrations and data warehousing there is the need to move data between two or more systems. In the simplest scenario there are only two systems involved, a source and a target system, though there can be complex scenarios in which data from multiple sources need to be available in a common target system (as in the case of data warehouses/marts or data migrations), or data from one source (e.g. ERP systems) need to be available in other systems (e.g. Web shops, planning systems), or there can be complex cases in which there is a many-to-many relationship (e.g. data from two ERP systems are consolidated in other systems).  

The data can flow in one direction from the source systems to the target systems (aka unidirectional flow), though there can be situations in which once the data are modified in the target system they need to flow back to the source system (aka bidirectional flow), as in the case of planning or product development systems. In complex scenarios the communication may occur multiple times within same process until a final state is reached.

Independently of the number of systems and the type of communication involved, data need to flow between the systems as smooth as possible, assuring that the data are consistent between the various systems and available when needed. The architectures responsible for moving data between the sources are based on two simple mechanisms - push vs pull – or combinations of them.

A push mechanism makes data to be pushed from the source system into the target system(s), the source system being responsible for the operation. Typically the push can happen as soon as an event occurs in the source system, event that leads to or follows a change in the data. There can be also cases when is preferred to push the data at regular points in time (e.g. hourly, daily), especially when the changes aren’t needed immediately. This later scenario allows to still make changes to the data in the source until they are sent to other system(s). When the ability to make changes is critical this can be controlled over specific business rules.

A pull mechanism makes the data to be pulled from the source system into the target system, the target systems being responsible for the operation. This usually happens at regular points in time or on demand, however the target system has to check whether the data have been changed.

Hybrid scenarios may involve a middleware that sits between the systems, being responsible for pulling the data from the source systems and pushing them into the targets system. Another hybrid scenario is when the source system pushes the data to an intermediary repository, the target system(s) pulling the data on a need basis. The repository can reside on the source, target on in-between. A variation of it is when the source informs the target that a change happened at it’s up to the target to decide whether it needs the data or not.

The main differentiators between the various methods is the timeliness, completeness and consistency of the data. Timeliness refers to the urgency with which data need to be available in the target system(s), completeness refers to the degree to which the data are ready to be sent, while consistency refers to the degree the data from the source are consistent with the data from the target systems.

Based on their characteristics integrations seem to favor push methods while data migrations and data warehousing the pull methods, though which method suits the best depends entirely on the business needs under consideration.

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.