Showing posts with label packages. Show all posts
Showing posts with label packages. Show all posts

27 November 2020

Data Warehousing: ETL - An Introduction

 


ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios. 

ETL Data Flow

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.  

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately. 

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately. 

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior. 

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important. 

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution. 

Previous Post <<||>> Next Post

15 February 2010

Data Warehousing: Control Flow (Definitions)

"A component of a SQL Server Integration Services package that controls the flow of tasks within the package. The high-level workflow of a package. Often contains one or more data flow tasks." (Victor Isakov et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (70-444) Study Guide", 2007)

"Concept to define causal dependency between process steps to enforce a specific execution order." (Christoph Bussler, "B2B and EAI with Business Process Management", 2009)

"The flow of control defines a partial order relationship between the activities of a business process model, specifying in which temporal order they will be executed." (Cesare Pautasso, "Compiling Business Process Models into Executable Code", 2009)

"A group of connected control flow elements that perform tasks." (Microsoft, "SQL Server 2012 Glossary", 2012)

"An abstract representation of all possible sequences of events (paths) during execution of a component or system. Often represented in graphical form, see control flow graph." (Tilo Linz et al, "Software Testing Foundations" 4th Ed., 2014)

"The control-flow perspective focuses on the control-flow, i.e., the ordering of activities. The goal of mining this perspective is to find a good characterization of all possible paths. Other popular perspectives are the organizational perspective and the case perspective." (Pavlos Delias et al, "Applying Process Mining to the Emergency Department", Encyclopedia of Business Analytics and Optimization, 2014)

"In DB2 data warehousing, a graphical model that sequences data flows and mining flows, integrates external commands, programs, and stored procedures, and provides conditional processing logic for a data warehouse application." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A sequence of events (paths) in the execution through a component or system." (ISTQB)

"The ordered workflow in an Integration Services package that performs tasks." (Microsoft Technet)

23 January 2010

Data Warehousing: ETL (SSIS packages vs. SQL code)

Data Warehousing

I’ve been working with DTS (Data Transformation Services) packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful in handling ETL (Extract, Transform, Load) tasks, especially when importing and exporting data between multiple data sources. The functionality provided by the DTS platform was basic, though it could be extended in custom packages with code developed in VB or by using ActiveX tasks in VBScript/Jscript, when the SQL-based logic was not enough. In addition, all three types of coding could make calls to .dlls, and thus operating system, vendor or in-house built libraries or simple functions could be (re)used. 

Starting with SQL Server 2005, DTS was replaced by the SSIS (SQL Server Integration Services) component, becoming with SSRS (SQL Reporting Services) and SSAS (SQL Server Analysis Services) integrant part of Microsoft's data platform. Besides providing a new architecture, SSIS extended the functionality previously available, bringing more flexibility in constructing the packages, their elements and data manipulation. SSIS became with each new version of SQL Server (2008, 2012, 2014, 2016, 2017, 2019) a powerful ETL tool that can be easily used for Data Warehousing, Data Migration or Data Integration projects. 

In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the (ETL) package, or use the package mainly for loading data from the various sources and have the business logic in the database as SQL-based code. As always, each of the two philosophies has its own advantages and disadvantages, though I would consider a third philosophy – design for performance, reuse and maintainability, resulting thus a hybrid between the first two mentioned philosophies. 

There are several other factors that need to be considered when building ETL solutions - synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no perfect receipe, no architecture matching all requirements, each solution coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements and problem at hand. 

Performance

Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequential or workflow-based processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches via a prototype, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore it's important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

I find that the business logic implemented in SQL code is much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, even by non-technical people, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram) and need a higher level of expertize. 

Security

Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and database engines could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment
 
Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and the database engines are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

The learning curve of technologies is always an important factor that needs to be considered in development, as it reflects how much time an average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both - SSIS architecture, respectively SQL-based programming - though it’s not easy to acquire both problem-solving mindsets. A SSIS developer usually attempts using as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. In the end, it’s important to know how to balance between the two.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.