About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Saturday, January 23, 2010

ETL: SSIS packages vs. SQL code

    I’ve been working with DTS packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful , especially the import/export feature between multiple data sources. Of course, the functionality provided by packages was basic, though it could be extended by Custom Packages code developed in VB or by ActiveX tasks in VBScript/Jscript, when the SQL Scripting based logic was not enough; in plus all three types of coding could make calls to .dlls, lot of system, vendor or in-house built functionality could be thus (re)used. SSIS architecture coming with SQL Server 2005 and further enhanced in SQL Server 2008, extended the functionality of DTS packages, bringing more flexibility in constructing the packages, their elements and data manipulation, making from SQL Server a powerful ETL tool.

    In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the package, or use a package mainly for loading data from the various sources and have the business logic in the database as SQL code. As always, each of the two philosophies has its own pluses and minuses, though I would consider a third philosophy – design for performance, reuse and maintainability, often resulting a hybrid between the first two mentioned philosophies. There are several other facts that need to be considered – maintainability, synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no receipt/architecture matching all requirements, each requirement coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements

Performance

    Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequence processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

    Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

    Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore is important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

    Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

    I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

    I find that the business logic implemented in SQL code it’s much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

    Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram).

Security

    Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

    We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and SQL Engine could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment
    Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

    Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and SQL Server Engine are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

    Learning curve of technologies is always an important factor that needs to be considered in development, reflecting how much time the average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both architectures, though in general it’s not easy to acquire both problem solving mindsets, a SSIS developer usually attempting to use as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. As I already highlighted above, it’s important to know how to balance between the two.

1 comment:

naresh m said...

Nice to read this article i am expecting many more articles from you. erptree