SQL Troubles: DTS

Showing posts with label DTS. Show all posts

02 March 2024

🧭Business Intelligence: Microsoft Releases for the BI Technology Stack (Timeline)

Business Intelligence Series

I started some years back to put together a timeline for the most important events happening in the BI technology stack (work in progress):

2023: Microsoft announces Microsoft Fabric (>>)

Synapse Data Warehouse is the next generation of data warehousing in Microsoft Fabric with native support for the delta lake.
Data Engineering & Data Science workloads with support for lakehouses, notebooks, Spark Job definitions, models and experiments.
Real-Time Analytics is a robust platform tailored to deliver real-time data insights and observability analytics capabilities for a wide range of data types.
OneLake provides a single unified storage location for all your data analytics needs.

2022: Microsoft releases SQL Server 2022 (>>)

Synapse Link for SQL Server 2022 allows to seamlessly replicate operational data in near real-time to be able to have more powerful analytics.
Purview is a unified data governance and management service.

2019: Microsoft launches Azure Synapse Analytics service (formerly SQL Data Warehouse), a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics. (>>)

2019: Microsoft releases SQL Server 2019 (>>)

Big Data Clusters add-in for SQL Server allows to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes (feature to be retired)

2018: Microsoft extends PowerQuery with ETL capabilities. (>>)

2018: Microsoft releases Azure Data Studio, a data management tool that enables to work with SQL Server, Azure SQL DB and SQL DW from Windows, macOS and Linux. (>>)

2017: Microsoft releases Power BI Report Server, an on-premises server that enables Power BI Pro users to publish Power BI reports and distribute them broadly across the enterprise, without requiring report consumers to be licensed individually per use (>>)

2017: Microsoft released SQL Server Data Tools (SSDT), which uses PowerQuery to import and prepare data in SSAS/AAS tabular models.

2017: Microsoft releases SQL Server 2017. (>>)

SSRS is no longer available to install through SQL Server setup.
Python support added, R Services renamed to Machine Learning Services. (>>)

2016: Microsoft releases SQL Server 2016 (What's new, >>)

Query Store allows to monitor and troubleshoot performance issues.
SQL Server R Services integrate the R programming language into SQL Server.
Direct Query for SSAS.
PolyBase for querying the data stored in HDFS. (>>)
Support for Support for HDFS in SSIS.
Azure SQL Data Warehouse is GA. (>>)
modern reports with SSRS. (>>)
Real-Time Operational Analytics. (>>)

2016: SQL Server 2014 Developer Edition becomes free. (>>)

2015: Microsoft announces elastic databases SQL Data Warehouse & Azure Data Lake. (>>)

Elastic databases allows to build SaaS applications to manage large numbers of databases that have unpredictable resource demand.
Azure SQL Data Warehouse is an elastic data warehouse in the cloud that can dynamically grow, shrink and pause compute in seconds independent of storage.
Azure Data Lake is a hyper-scale data store for big data analytic workloads.

2015: Microsoft releases Power BI to the general public.

Power BI Designer renamed to Power BI Desktop.

2015: Microsoft releases several Azure services:

launches the SQL Server Cloud database.
Azure Data Factory (ADF), a fully managed service that does information production by orchestrating data with processing services as managed data pipelines. (>>)
Azure Stream Analytics, a fully managed stream processing engine that is designed to analyze and process large volumes of streaming data with sub-millisecond latencies. (>>)

2014: Microsoft released Power BI Designer unifying Power Query, Power Pivot & Power View.

2013: Microsoft announces Power BI for Office 365. (>>)

2012: Microsoft releases with SQL Server 2012. (>>)

BI Semantic Model for SSAS provides a single, scalable model for BI applications.
Parallel Data Warehouse with PolyBase capabilities.
in-memory capabilities. (>>)
Windows Azure SQL Reporting service available (>>)
SQL Server Data Tools unifies SQL Server and cloud SQL Azure development for both professional database and application developers.

2010: Microsoft released

Power Pivot as part of SQL Server R2.
Azure SQL Database.

2010: Microsoft releases SQL Server 2008 R2.

Master Data Services.
Power Pivot & Self-service BI capabilities in SSAS.

2008: Microsoft releases SQL Server 2008 (>>)

Table compression.
Change Data Capture (CDC).

2005: Microsoft releases SQL Server 2005

a greatly enhanced version of Analysis Services.
SQL Server Integration Services to replace DTS.

2004: Microsoft released SQL Server Reporting Services (SSRS) as add-on to SQL Server 2000.

2000: Microsoft released SQL Server Analysis Services (SSAS) with SQL Server 2000.

1998: Microsoft released SQL Server 7.

OLAP services & first MDX specifications.
Data Transformation Services (DTS) for ETL workloads.

24 May 2020

🧊🎡☯Data Warehousing: SQL Server Integration Services (The Good, the Bad and the Ugly)

Microsoft SQL Server Integration Services (SSIS) is a platform for building (enterprise-level) data integrations and data transformations solutions by using a rich set of built-in tasks and transformations, graphical tools for building packages, respectively a catalog for storing the packages. Formally called Data Transformation Services (DTS), it was introduced with SQL Server 2000 and with SQL Server 2005 it was rebranded as SSIS.

The Good: Since its introduction it was adopted by DBAs and (database) programmers because it allowed the import and export of data on the fly from and to SQL Server, flat files, other relational data sources, in fact any resource exposing a driver for ODBC or OLEDB libraries. The extract/load functionality was extended by a basic set of transformations, making from DTS the ideal ETL tool for data warehousing and integrations. The data from multiple sources and targets could be processed in parallel or sequentially, the ETL logic being encapsulated in one or more packages that could be run manually or scheduled via the SQL Server agent flexibly.

With SQL Server 2005 and further versions the SSIS framework was extended to support further data sources including XML, CAML-based SharePoint lists, OData, Hadoop or Azure Bloob. It allowed the storage of packages on the local storage or within the built-in catalog.

One could thus develop rich ETL functionality without writing a single line of code. In theory the packages could be run and modified also by non-IT users, which can be a plus in certain scenarios. On the other side one could build custom packages programmatically from the beginning, and thus extend the available data processing logic as seemed fit, being able to using existing code and whole libraries embedded into the packages or run via dlls calls .

The Bad: Despite the rich functionality, a data pipeline usually has a lower performance and is more difficult to troubleshoot compared with the built-in RDBMS functionality for data processing. Most, if not all transformations can be handled over SQL-based queries more efficiently as long the data are available on the same SQL Server instance. In addition, SQL provides better code reuse, maintainability, chances for refactoring, scalability and the solutions are easier to deploy. Therefore, one practice resumes in using SSIS only for import/export, the further logic being encapsulated into stored procedures and further database objects. This isn’t necessarily bad, on contrary, though specific expertise is needed then to modify the code.

The Ugly: SSIS is in general suitable for data warehousing and integrations solutions whose logic is ideally stable and well-defined. Therefore, SSIS is less suitable for ERP data migrations or similar task which at least at the beginning have an exploratory nature and an overwhelming complexity, multiple iterations being needed before the requirements were fully identified and understood. In extremis each iteration can involve a redesign, which can prove to be time-consuming. One could in theory attempt first understanding all the data, though this could mean starting the development late in the process, while the data for testing are required much earlier. One can still use SSIS for specific tasks, though implementing a whole solution could imply certain challenges that otherwise could have been avoided.

SSIS is not suitable for real-time complex data integrations which require the processing of a considerable amount of data, when specific architectures like SOA, Restful calls or other solution could be more efficient. When not adequately implemented a data integration can lead to more problems than it can solve. Best example is the increase in execution time with the volume of data, fact that can easily lead to time-outs and locking of data.

Previous Post <<||>> Next Post

29 July 2019

🧱IT: Package (Definitions)

"A Data Transformation Services (DTS) object that defines one or more tasks to be executed in a coordinated sequence to import, export, or transform data." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"In PL/SQL, a program unit that can contain other PL/SQL constructs, including procedures, functions, variables, constants, exceptions, datatypes, and cursors. Packages have a specification that serves as an API, and an optional body. In addition to providing some features available in no other way (such as overloading and the ability to save variable state throughout a session), packages can improve software design, performance, and reusability." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A group of affiliated Java classes and interfaces. Packages organize classes into distinct name spaces. Classes are placed in packages by using the package keyword in the class definition. A package limits the visibility of classes and minimizes name collision." (Marcus Green & Bill Brogden, "Java 2™ Programmer Exam Cram™ 2 (Exam CX-310-035)", 2003)

"A container of tasks used by Microsoft SQL Server 2005 Integration Services (SSIS) that can be organized into a specific sequence for processing Analysis Services commands, to name just one capability of SSIS." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"An executable module that is a collection of class implementations." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"In UML, a package is a graphical mechanism used to organize classes into groups for better readability." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"A collection of control flow and data flow elements that runs as a unit." (SQL Server 2012 Glossary, "Microsoft", 2012)

"A namespace for global variables, subroutines, and the like, such that they can be kept separate from like-named symbols in other namespaces. In a sense, only the package is global, since the symbols in the package’s symbol table are only accessible from code compiled outside the package by naming the package. But in another sense, all package symbols are also globals - they’re just well-organized globals." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"In Java programming, a group of types. Packages are declared with the package keyword. (Sun) In PL/SQL programming, a collection of database objects that is defined by using a CREATE PACKAGE statement and represented as a module. See also module. A control-structure database object produced during program preparation that can contain both executable forms of static SQL statements or XQuery expressions and placement holders for executable forms of dynamic SQL statements." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

25 March 2010

🧊Data Warehousing: Mea Culpa (Part I: A Personal Journey)

Any discussion on data warehousing topics, even unconventional, can’t avoid to mention the two most widely adopted concepts in data warehousing, B. Inmon vs. R. Kimball’s methodologies. There is lot of ink consumed already on this topic and is difficult to come with something new, however I can insert in between my experience and personal views on the topic. From the beginning I have to state that I can’t take any of the two sides because from a philosophical viewpoint I am the adept of “the middle way” and, in addition, when choosing a methodology we have to consider business’ requirements and objectives, the infrastructure, the experience of resources, and many other factors. I don’t believe one method fits all purposes, therefore some flexibility is needed into this concern even from most virulent advocates. After all in the end it counts the degree to which the final solution fits the purpose, and no matter how complex and perfect is a methodology, no matter of the precautions taken, given the complexity of software development projects there is always the risk for failure.

B. Inmon defines the data warehouse as a “subject-oriented, integrated, non-volatile and time-varying collection of data in support of the management’s decisions” [3] - subject-oriented because is focused on an organization’s strategic subject areas, integrated because the data are coming from multiple legacy systems in order to provide a single overview, time-variant because data warehouse’s content is time dependent, and non-volatile because in theory data warehouse’s content is not updated but refreshed.

Within my small library and the internet articles I read on this topic, especially the ones from Kimball University cycle, I can’t say I found a similar direct definition for data warehouse given by R. Kimball, the closest I could get to something in this direction is the data warehouse as a union of data marts, in his definition a data mart is “a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions” [2]. This reflects also an important difference between the two approaches, in Inmon’s philosophy the data marts are updated through the data warehouse, the data in the warehouse being stored in a 3rd normal form, while in data marts are multidimensional and thus denormalized.

Even if it’s a nice conceptual tool intended to simplify data manipulation, I can’t say I’m a big fan of dimensional modeling, mainly because it can be easily misused to create awful (inflexible) monster models that can be barely used, sometimes being impossible to go around them without redesigning them. Also the relational models could be easily misused though they are less complex as physical design, easier to model and they offer greater flexibility even if in theory data’s normalization could add further complexity, however there is always a trade between flexibility, complexity, performance, scalability, usability and reusability, to mention just a few of the dimensions associated with data in general and data quality in particular.

In order to overcome dimensional modeling issues R. Kimball recommends a four step approach – first identifying the business processes corresponding to a business measurement or event, secondly declaring the grain (level of detail) and only after that defining the dimensions and facts [1]. I have to admit that starting from the business process adds a plus to this framework because in theory it allows better visibility over the processes, supporting processed-based data analysis, though given the fact that a process could span over multiple data elements or that multiple processes could partition the same data elements, this increases the complexity of such models. I find that a model based directly on the data elements allows more flexibility in the detriment of the work needed to bring the data together, though they should cover also the processes in scope.

Building a data warehouse it’s quite a complex task, especially if we take into consideration the huge percentage of software projects failure that holds also in data warehousing area. On the other side not sure how much such statistics about software projects failure can be taken ad literam because different project methodologies and data collection methods are used, not always detailed information are given about the particularities of each project, it would be however interesting to know what the failure rate per methodology. Occasionally there are some numbers advanced that sustain the benefit of using one or another methodology, and ignoring the subjective approach of such justifications they often lack adequate details to support them.

My first contact with building a data warehouse was almost 8 years ago, when as part of the Asset Management System I was supposed to work on, the project included also the creation of a data warehouse. Frankly few things are more scaring than seeing two IT professionals fighting on what approach to use in order to design a data warehouse, and is needless to say that the fight lasted for several days, calls with the customer, nerves, management involved, whole arsenal of negotiations that looked like a never ending story.

Such fights are sometimes part of the landscape and they should be avoided, the simplest alternative being to put together the advantages and disadvantages of most important approaches and balance between them, unfortunately there are still professionals who don’t know how or not willing to do that. The main problem in such cases is the time which instead of being used constructively was wasted on futile fights. When lot of time is waisted and a tight schedule applies, one is forced to do the whole work in less time, leading maybe to sloppy solutions.

A few years back I had the occasion to develop one data warehouse around the two ERP systems and the other smaller systems one of the customers I worked for was having in place, SQL Server 2000 and its DTS (Data Transformation Services) functionality being of great help for this purpose. Even if I was having some basic knowledge on the two data warehousing approaches, I had to build the initial data warehouse from scratch evolving the initial solution in time along several years.

The design was quite simple, the DTS packages extracting the data from the legacy systems and dumping them in staging tables in normalized or denormalized form, after several simple transformations loading the data in the production tables, the role of the multidimensional data marts being played successfully by views that were scaling pretty well to the existing demands. Maybe many data warehouse developers would disregard such a solution, though it was quite an useful exercise and helped me to easier understand later the literature on this topic and the issues related to it. In addition, while working on the data conversion of two ERP implementations I had to perform more complex ETL (Extract Transform Load) tasks that the ones consider in the data warehouse itself.

In what concerns software development I am an adept of rapid evolutional prototyping because it allows getting customers’ feedback in early stages and thus being possible to identify earlier the issues as per customers’ perceptions, in plus allowing customers to get a feeling of what’s possible, how the application looks like. The prototyping method proved to be useful most of the times, I would actually say all the times, and often was interesting to see how customers’ conceptualization about what they need changed with time, changes that looked simple leading to partial redesign of the application. In other development approaches with long releases (e.g. waterfall) the customer gets a glimpse of the application late in the process, often being impossible to redesign the application so the customer has to live with what he got. Call me “old fashion” but I am the adept of rapid evolutional prototyping also in what concerns the creation of data warehouses, and even if people might argue that a data warehousing project is totally different than a typical development project, it should not be forgotten that almost all software development projects share many particularities from planning to deployment and further to maintenance.

Even if also B. Inmon embraces the evolutional/iterative approach in building a data warehouse, from a philosophical standpoint the rapid evolutional prototyping applied to data warehouses I feel it’s closer to R. Kimball’s methodology, resuming in choosing a functional key area and its essential business processes, building a data mart and starting from there building other data marts for the other functional key areas, eventually integrating and aligning them in a common solution – the data warehouse. On the other side when designing a software component or a module of one application you have also to consider the final goal, as the respective component or module will be part of a broader system, even if in some cases it could exist in isolation. Same can be said also about data marts’ creation, even if sometimes a data mart is rooted in the needs of a department, you have to look also at the final goal and address the requirements from that perspective or at least be aware of them.

Previous Post <<||>> Next Post

References:

[1] M. Ross R. Kimball, (2004) Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? [Online] Available from: http://intelligent-enterprise.informationweek.com/info_centers/data_warehousing/showArticle.jhtml;jsessionid=530A0V30XJXTDQE1GHPSKH4ATMY32JVN?articleID=49400912 (Accessed: 18 March 2010)

[2] R. Kimball, J. Caserta (2004). The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley Publishing Inc. ISBN: 0-7645-7923 -1

[3] Inmon W.H. (2005) Building the Data Warehouse, 4th Ed. Wiley Publishing. ISBN: 978-0-7645-9944-6

23 March 2010

🕋Data Warehousing: Data Transformation (Definitions)

"A set of operations applied to source data before it can be stored in the destination, using Data Transformation Services (DTS). For example, DTS allows new values to be calculated from one or more source columns, or a single column to be broken into multiple values to be stored in separate destination columns. Data transformation is often associated with the process of copying data into a data warehouse." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of reformatting data based on predefined rules. Most often identified as part of ETL (extraction, transformation, and loading) but not exclusive to ETL, transformation can occur on the CDI hub, which uses one of several methods to transform the data from the source systems before matching it." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Any change to the data, such as during parsing and standardization." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A process by which the format of data is changed so it can be used by different applications." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Converting data from one format to another|making the data reflect the needs of the target application. Used in almost any data initiative, for instance, a data service or an ETL (extract, transform, load) process." (Tony Fisher, "The Data Asset", 2009)

"Changing the format, structure, integrity, and/or definitions of data from the source database to comply with the requirements of a target database." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The SSIS data flow component that modifies, summarizes, and cleans data." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Data transformation is the process of making the selected data compatible with the structure of the target database. Examples include: format changes, structure changes, semantic or context changes, deduping, and reordering." (Piethein Strengholt, "Data Management at Scale", 2020)

"1. In data warehousing, the process of changing data extracted from source data systems into arrangements and formats consistent with the schema of the data warehouse. 2. In Integration Services, a data flow component that aggregates, merges, distributes, and modifies column data and rowsets." (Microsoft Technet)

19 February 2010

🎡SSIS: Visual Studio 2010 CR doesn’t support SSIS Projects

I’ve installed today Visual Studio 2010 CR (Candidate Release) without any problems - nice look and feel, new types of projects targeting several new Microsoft solutions (SharePoint 2010, MS Office 2010, Windows Azure, ASP.NET MVC 2, Silverlight 3), everything looked nice until I wanted to create a SSIS project – unfortunately no support for SSIS 2005/2008 within VS 2010. As it seems I’m not the only person looking for that functionality, several discussion forums approaching already this topic on MSDN, SQL Server Developer Center or Microsoft Connect.

The next version of SQL Server is not 2010, as many would expect, but Microsoft is working on SQL Server 2008 R2 avoiding somehow to synchronize the SQL Server Service solutions with VS 2010. Not sure if this strategic decision was taken for the sake of profit, but there is something fishy about it.

So, if I want to buy VS 2010 and SQL Server 2008 then I need to install also VS 2008, having thus 2 IDEs on my system, they might work together without problems though that means more space, more effort on managing the patches for the two IDE and I would expect that there will be synchronization differences between them, migrate forth and back solutions between frameworks, more issues to troubleshoot. Is this what Microsoft understands by “Integration”?! In the end Microsoft looses because I would expect there will be customers that would avoid moving to VS 2010 just because of the lack of support for SSIS 2005/2008. I wonder how much will hurt customers this lack of SSIS solution integration between VS 2010 and SQL Server 2008?! Quoting an unofficial comment made on one of the forums, it seems that “there will be a separate add-on that will give limited functionality” [1] in working with SSIS, though is this the right approach?

Being used with the DTS packages functionality available on SQL Server 2000, I found it somehow awkward that Microsoft decided to separate SSIS from the SQL Server Management Studio, on one side I understand that the complexity of SSIS projects requires a IDE much like the one provided by Visual Studio, though for the developer used to SQL Server 2000 not sure if this approach was welcome. Of course, data export/import functionality is available using the’ SQL Server Import and Export Wizard’ though it’s not the same thing, and even if I find the ‘Execute Package Utility’ quite easy to use, I can’t say I’m a fan of it. I wonder which Microsoft’s plans are for the future…

It seems there are even more surprises from Microsoft that could come with SQL Server R2, for example "using SSIS package as a data source for SSRS report datasets" [2], even if the respective feature is a non-production feature, what happens with the customers that already built their solutions on it?! Of course, Microsoft doesn’t recommend the use of non-standard features as they might not be supported on upper versions, but you know developers, why reinvent the wheel when there is already some functionality for that purpose!

I would expect that more such issues will be discovered once developers start to play with VS 2010 and the coming SQL Server 2008 R2.

References:
[1] Microsoft Connect. (2010). SSIS VS2010 project type. [Online] Available from: https://connect.microsoft.com/SQLServer/feedback/details/508552/ssis-vs2010-project-type (Accessed: 19 February 2010)
[2] Siddhumehta. (2010). SSIS Package not supported as a data source in SQL Server Reporting Services ( SSRS ) 2008 R2. [Online] Available from: http://siddhumehta.blogspot.com/2010/02/ssis-package-not-supported-as-data.html (Accessed: 19 February 2010)

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

I’ve been working with DTS (Data Transformation Services) packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful in handling ETL (Extract, Transform, Load) tasks, especially when importing and exporting data between multiple data sources. The functionality provided by the DTS platform was basic, though it could be extended in custom packages with code developed in VB or by using ActiveX tasks in VBScript/Jscript, when the SQL-based logic was not enough. In addition, all three types of coding could make calls to .dlls, and thus operating system, vendor or in-house built libraries or simple functions could be (re)used.

Starting with SQL Server 2005, DTS was replaced by the SSIS (SQL Server Integration Services) component, becoming with SSRS (SQL Reporting Services) and SSAS (SQL Server Analysis Services) integrant part of Microsoft's data platform. Besides providing a new architecture, SSIS extended the functionality previously available, bringing more flexibility in constructing the packages, their elements and data manipulation. SSIS became with each new version of SQL Server (2008, 2012, 2014, 2016, 2017, 2019) a powerful ETL tool that can be easily used for Data Warehousing, Data Migration or Data Integration projects.

In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the (ETL) package, or use the package mainly for loading data from the various sources and have the business logic in the database as SQL-based code. As always, each of the two philosophies has its own advantages and disadvantages, though I would consider a third philosophy – design for performance, reuse and maintainability, resulting thus a hybrid between the first two mentioned philosophies.

There are several other factors that need to be considered when building ETL solutions - synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no perfect receipe, no architecture matching all requirements, each solution coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements and problem at hand.

Performance

Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequential or workflow-based processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches via a prototype, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore it's important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

I find that the business logic implemented in SQL code is much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, even by non-technical people, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram) and need a higher level of expertize.

Security

Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and database engines could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment

Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and the database engines are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

The learning curve of technologies is always an important factor that needs to be considered in development, as it reflects how much time an average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both - SSIS architecture, respectively SQL-based programming - though it’s not easy to acquire both problem-solving mindsets. A SSIS developer usually attempts using as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. In the end, it’s important to know how to balance between the two.

Previous Post <<||>> Next Post

19 November 2005

💠🛠️SQL Server: Administration (Part I: Troubleshooting Problems in SQL Server 2000 DTS Packages)

There are cases in which a problem can not be solved with a query, additional data processing being requested. In this category of cases can be useful to use a temporary table or the table data type. Once the logic built you want to export the data using a simple DTS project having as Source the built query.

While playing with SQL Server 2005 (Beta 2) I observed that the DTS Packages have problem in handling temporary tables as well table data types.

1. Temporary table Sample:

CREATE PROCEDURE dbo.pTemporaryTable
AS
CREATE TABLE #Temp(ID int, Country varchar(10)) -- create a temporary table

-- insert a few records
INSERT #Temp VALUES (1, 'US')
INSERT #Temp VALUES (2, 'UK')
INSERT #Temp VALUES (3, 'Germany')
INSERT #Temp VALUES (4, 'France')   

SELECT * FROM #Temp -- select records

DROP TABLE #Temp – drop the temporary table

-- testing the stored procedure
EXEC dbo.pTemporaryTable

Output:
ID Country
----------- ----------
1 US
2 UK
3 Germany
4 France
(4 row(s) affected)

Trying to use the stored procedure as Source in a DTS package brought the following error message:

Error Source: Microsoft OLE DB Provider for SQL Server Error Description: Invalid object name ‘#Temp’

I hoped this has been fixed in SQL Server 2005 (Beta 3) version, but when I tried to use the stored procedure as a Source I got the error:

Error at Data Flow Task [OLE DB Source [1]]: An OLE DB error has occurred. Error code: 0x80004005 An OLE DB record is available. Source: "Microsoft OLE DB Provider for SQL Server" Hresult: 0x80004005 Description: "Invalid object name '#Temp'."

Error at Data Flow Task [OLE DB Source [1]]: Unable to retrieve column information from the data source. Make sure your target table in the database is available. ADDITIONAL INFORMATION: Exception from HRESULT: 0xC020204A (Microsoft.SqlServer.DTSPipelineWrap)

2. Table data type sample:

CREATE PROCEDURE dbo.pTableDataType
AS
DECLARE @Temp TABLE (ID int, Country varchar(10)) -- create table

-- insert a few records
INSERT @Temp VALUES (1, 'US')
INSERT @Temp VALUES (2, 'UK')
INSERT @Temp VALUES (3, 'Germany')
INSERT @Temp VALUES (4, 'France')   

SELECT * FROM @Temp -- select records

-- testing the stored procedure
EXEC dbo.pTableDataType

Output:
ID Country
----------- ----------
1 US
2 UK
3 Germany
4 France
(4 row(s) affected)

This time the error didn't appear when the Source was provided, but when the package was run:

Error message: Invalid Pointer

In Server 2005 (Beta 3) it raised a similar error:

SSIS package "Package6.dtsx" starting. Information: 0x4004300A at Data Flow Task, DTS.Pipeline: Validation phase is beginning Information: 0x4004300A at Data Flow Task, DTS.Pipeline: Validation phase is beginning Information: 0x40043006 at Data Flow Task, DTS.Pipeline: Prepare for Execute phase is beginning Information: 0x40043007 at Data Flow Task, DTS.Pipeline: Pre-Execute phase is beginning SSIS package "Package6.dtsx" finished: Canceled.

I saw there will be problems because in Flat File Connection Manager’s Preview no data were returned. I hope these problems were solved in the last SQL Server Release, if not we are in troubles!

Possible Solutions:
If the Stored Procedures provided above are run in SQL Query Analyzer or using ADO, they will work without problems. This behavior is frustrating, especially when the logic is really complicated and you put lot of effort in it; however there are two possible solutions, both involving affordable drawbacks:
1. Create a physical table manually or at runtime, save the data in it and later remove the table. This will not work for concurrent use (it works maybe if the name of the table will be time stamped), but for ad-hoc reports might be acceptable.
2. Built an ADO based component which exports the data to a file in the format you want. The component is not difficult to implement, but additional coding might be requested for each destination type. The component can be run from a UI or from a DTS package.
3. In SQL Server Yukon is possible to use the CLR and implement the logic in a managed stored procedure or table valued function.

Happy Coding!

14 November 2005

💎SQL Reloaded: Out of Space

Some time ago I found a post in which somebody was annoyed of writing code like this:

DECLARE @zipcode char(5)

SET @zipcode = '90'

SELECT CASE len(@zipcode)
 WHEN 1 THEN '0000' + @zipcode
 WHEN 2 THEN '000' + @zipcode
 WHEN 3 THEN '00' + @zipcode
 WHEN 4 THEN '0' + @zipcode
 WHEN 5 THEN @zipcode
END

I remembered I met the same situation, but instead of writing a line of code for each case, I prefered to write a simple function, which used the Space and Replace built-in functions to add first a number of spaces and then replace them with a defined character. Maybe somebody will argue about the performance of such a function, but I prefer in many cases the reusability against performance, when the difference is not that big. Additionally, this enhances code’s readability.

So, here is the function:

CREATE FUNCTION dbo.FillBefore(
   @target varchar(50)
 , @length int
 , @filler char(1))
/*
   Purpose: enlarges a string to a predefiend length by puting in front of it the same character
   Parameters: @target varchar(50) - the string to be transformed
  , @length int - the length of output string
  , @filler char(1) - the character which will be used as filler 
   Notes: the function works if the length of @length is greater or equal with the one of the @target, otherwise will return a NULL
*/
RETURNS varchar(50)
AS
BEGIN
   RETURN Replace(Space(@length-len(IsNull(@target, ''))), ' ', @filler) 
       + IsNull(@target, '')
END

-- testing the function
SELECT dbo.FillBefore('1234', 5, '0')
SELECT dbo.FillBefore('1234', 10, ' ')
SELECT dbo.FillBefore(1234, 10, ' ')
SELECT dbo.FillBefore(Cast(1234 as varchar(10)), 10, '0')

Another scenario in which the Space function was useful was when I had to save the result of a query to a text file in fixed width format. It resumes to the same logic used in FillBefore function, except the spaces must be added at the end of the string. For this I just needed to modify the order of concatenation and create a new function:

CREATE FUNCTION dbo.FillAfter(
   @target varchar(50)
 , @length int
 , @filler char(1))
/*
   Purpose: enlarges a string to a predefiend length by adding at the end the same character
   Parameters: @target varchar(50) - the string to be transformed
  , @length int - the length of output string
  , @filler char(1) - the character which will be used as filler 
   Notes: the function works if the length of @length is greater or equal with the one of the @target, otherwise will return a NULL

*/
RETURNS varchar(50)
AS
BEGIN
   RETURN IsNull(@target, '') 
        + Replace(Space(@length-len(IsNull(@target,''))), ' ', @filler)
END

  -- testing the function 
  SELECT dbo.FillAfter('1234', 5, '0')
  SELECT dbo.FillAfter('1234', 10, ' ')
  SELECT dbo.FillAfter(1234, 10, '0')
  SELECT dbo.FillAfter(Cast(1234 as varchar(10)), 10, '0')
  SELECT dbo.FillAfter(NULL, 10, '0')

Notes:
1. In SQL Server 2005, the output of a query can be saved using a DTS Package with fix format directly to a text file using ‘fixed width’ (the columns are defined by fix widths) or ‘ragged right’ (the columns are defined by fix widths, except the last one which is delimited by the new line character) format for the destination file.
2. To document or not to document. I know that many times we don't have the time to document the database objects we created, but I found that a simple comment saved more time later, even if was about my or other developer's time.
3. If is really needed to use the CASE function, is good do a analysis of the data and put as first evaluation the case with the highest probability to occur, as second evaluation the case with the second probability to occur, and so on. The reason for this is that each option is evaluated until the comparision operation evaluates to TRUE. For example, if the zip codes are made in most of the cases of 5 characters, then it makes sense to put it in the first WHEN position.
4) The code has been tested successfully also on a SQL database in Microsoft Fabric.

Happy coding!

SQL Troubles

Pages

02 March 2024

🧭Business Intelligence: Microsoft Releases for the BI Technology Stack (Timeline)

24 May 2020

🧊🎡☯Data Warehousing: SQL Server Integration Services (The Good, the Bad and the Ugly)

29 July 2019

🧱IT: Package (Definitions)

25 March 2010

🧊Data Warehousing: Mea Culpa (Part I: A Personal Journey)

23 March 2010

🕋Data Warehousing: Data Transformation (Definitions)

19 February 2010

🎡SSIS: Visual Studio 2010 CR doesn’t support SSIS Projects

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

19 November 2005

💠🛠️SQL Server: Administration (Part I: Troubleshooting Problems in SQL Server 2000 DTS Packages)

14 November 2005

💎SQL Reloaded: Out of Space

About Me