Showing posts with label ADO. Show all posts
Showing posts with label ADO. Show all posts

03 February 2021

Data Migrations (DM): Conceptualization V (Data Extraction Layer)

Data Migration

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

Data Migrations (DM): Conceptualization IV (Data Access)

Data Migration
Data Migrations Series

Once the data sources for a Data Migration (DM) were identified the first question is how the data can be accessed. The legacy systems relying on ODBC-based databases are in theory relatively easy to access as long they allow the direct access to their data, which would enable thus a pull strategy. Despite this, there are organizations that don’t allow the direct access to the data even for read-only operations, being preferred to push the data directly to the consumers (aka push strategy) or push the data to a given location from where the consumer can use the data as needed (aka hybrid strategy). 

The direct access to the data allows in theory the best flexibility as the solution can extract the data when needed and this especially important during the initial phases of the project when the data need to be pulled more frequently until the requirements and logic is stabilized. A push strategy tends to add additional overhead as usually somebody else oversees the data exports, respectively the data need to be prepared in the expected format. On the other side, it would make sense to make an exception for a DM and allow the direct access to the data. 

 Hybrid strategies tend to be more complex and require additional resources or overhead as the data are stored temporarily at a separate location. Unfortunately, in certain scenarios this is the only approach can be used. Are preferred data files that allow keeping the integrity of the data and facilitate data consumption. Therefore, tabular text files or JSON files are preferred in the detriment of XML or Excel files. It’s preferable to export one data structure individually then storing parent-child solutions even if the latter can prove to be useful in certain scenarios. When there’s no other solution one can use also the standard reports available in the legacy systems.

When storing data outside the legacy systems for further processing it’s recommended to follow organization’s best practices, respectively to address the data security and privacy requirements. ETL tools allows accessing data from password protected areas like FTP, OneDrive or SharePoint. The fewer security layers in between the lower is in theory the overhead. Therefore, given its stability and simplicity FTP might prove to be a better storage solution than OneDrive, SharePoint or other similar technologies.

Ideally the extraction/export mechanisms should use the database objects that encapsulate already the logic in the legacy systems otherwise the team will need to reengineer the logic – for master data this can prove to be easy, though the logic of transactional data like on-hand or open invoices can be relatively complex to reengineer. Otherwise, the logic can be implemented directly in the extraction/export mechanisms or sometimes is more advisable to create database objects (usually on a different schema) on the legacy systems and just call the respective objects. 

When connecting directly to the data source it’s advisable using the data provider which allows the best performance and flexibility, however several tests might be needed to determine the best fit. It would be useful to check the limitations of each provider and find a stable driver version.  OLEDB and ADO.Net data providers provide in general a good performance, though native drivers of the legacy systems can be a better option upon case. 

Some legacy systems allow the access to their data only via service-based technologies like OData. OData tends to have poor performance for large data exports than standard access methods and therefore not indicated in such scenarios. In such cases might be a good idea to export the data directly from the legacy system. 

Previous Post <<||>> Next Post

01 August 2010

Resources: SQL Server Troubleshooting I

    Last week the SQL Server troubleshooting-related topics came again on my table, and it’s actually quite an important topic. As I have disparate links I thought it makes sense to bring the links together in a post.

    When searching for information it’s always a good idea to start with the documentation or the support site, in what concerns SQL Server troubleshooting Microsoft has several valuable resources on what concerns performance issues, application performance, ad-hoc queries, blocking, stored procedure recompilation, clusters, etc.  Even if outdated, of interest could be also the Improving .Net Application Performance and Scalability resource,  with a chapter on SQL Server Performance and ADO.NET. Other resources could be found in Technet, for example Troubleshooting Performance Problems in SQL Server 2005, a similar document being available on SQL Server 2008 from MSDN. As a way to avoid the need for troubleshooting, it makes sense to check also the SQL Server Best Practices.

    There are several good books on this topic I heartily recommend, the first on my list is the book of C. Bolton et al, Professional SQL Server 2008 Internals and Troubleshooting. What’s interesting to note is that the book is available to browse and read on Scribd as part of a eBook Deal with Wiley, deal that includes several other programming and non-programming books (See Wiley’s profile on Scribd). The code for several chapters from the C. Bolton's book is available on Wiley’s website. Of interest are especially the chapters on DMV (dynamic management views) because performance troubleshooting often resumes in searching for hints in SQL Server internal tables. Another book, actually booklet, on DMV comes from Redgate, the SQL Server DMV Starter Pack, the pack being available also with a list of 30 sample scripts downloadable together as zip.

   Talking about scripts, there are many scripts available on DMV from Technet’s Script Repository. If you want to get an overview of your SQL Server configuration and health, you could check Diagnostic Information Queries available from Glen Berry’s blog for SQL Server 2005, respectively SQL Server 2008 and 2008R2.

19 November 2005

SQL Server Troubleshooting: Problems in SQL Server 2000 DTS Packages

    There are cases in which a problem can not be solved with a query, additional data processing being requested. In this category of cases can be useful to use a temporary table or the table data type. Once the logic built you want to export the data using a simple DTS project having as Source the built query.

    While playing with SQL Server 2005 (Beta 2) I observed that the DTS Packages have problem in handling temporary tables as well table data types.

1. Temporary table Sample:

CREATE PROCEDURE dbo.pTemporaryTable
AS
CREATE TABLE #Temp(ID int, Country varchar(10)) -- create a temporary table

-- insert a few records
INSERT #Temp VALUES (1, 'US')
INSERT #Temp VALUES (2, 'UK')
INSERT #Temp VALUES (3, 'Germany')
INSERT #Temp VALUES (4, 'France')   

SELECT * FROM #Temp -- select records

DROP TABLE #Temp  drop the temporary table
 
-- testing the stored procedure
EXEC dbo.pTemporaryTable
Output:
ID Country
    ----------- ----------
1 US
2 UK
3 Germany
4 France
(4 row(s) affected)

Trying to use the stored procedure as Source in a DTS package brought the following error message: Error Source: Microsoft OLE DB Provider for SQL Server Error Description: Invalid object name ‘#Temp’   

 I hoped this has been fixed in SQL Server 2005 (Beta 3) version, but when I tried to use the stored procedure as a Source I got the error:
Error at Data Flow Task [OLE DB Source [1]]: An OLE DB error has occurred. Error code: 0x80004005 An OLE DB record is available. Source: "Microsoft OLE DB Provider for SQL Server" Hresult: 0x80004005 Description: "Invalid object name '#Temp'."
Error at Data Flow Task [OLE DB Source [1]]: Unable to retrieve column information from the data source. Make sure your target table in the database is available. ADDITIONAL INFORMATION: Exception from HRESULT: 0xC020204A (Microsoft.SqlServer.DTSPipelineWrap)

2. Table data type sample:

CREATE PROCEDURE dbo.pTableDataType
AS
DECLARE @Temp TABLE (ID int, Country varchar(10)) -- create table

-- insert a few records
INSERT @Temp VALUES (1, 'US')
INSERT @Temp VALUES (2, 'UK')
INSERT @Temp VALUES (3, 'Germany')
INSERT @Temp VALUES (4, 'France')   

SELECT * FROM @Temp -- select records

-- testing the stored procedure
EXEC dbo.pTableDataType

Output:
ID Country
   ----------- ----------
1 US
2 UK
3 Germany
4 France
(4 row(s) affected)    

This time the error didn't appear when the Source was provided, but when the package was run: Error message: Invalid Pointer    

In Server 2005 (Beta 3) it raised a similar error:
SSIS package "Package6.dtsx" starting. Information: 0x4004300A at Data Flow Task, DTS.Pipeline: Validation phase is beginning Information: 0x4004300A at Data Flow Task, DTS.Pipeline: Validation phase is beginning Information: 0x40043006 at Data Flow Task, DTS.Pipeline: Prepare for Execute phase is beginning Information: 0x40043007 at Data Flow Task, DTS.Pipeline: Pre-Execute phase is beginning SSIS package "Package6.dtsx" finished: Canceled.    

 I saw there will be problems because in Flat File Connection Manager’s Preview no data were returned. I hope these problems were solved in the last SQL Server Release, if not we are in troubles!

Possible Solutions:    
If the Stored Procedures provided above are run in SQL Query Analyzer or using ADO, they will work without problems. This behavior is frustrating, especially when the logic is really complicated and you put lot of effort in it; however there are two possible solutions, both involving affordable drawbacks:
1. Create a physical table manually or at runtime, save the data in it and later remove the table. This will not work for concurrent use (it works maybe if the name of the table will be time stamped), but for ad-hoc reports might be acceptable.
2. Built an ADO based component which exports the data to a file in the format you want. The component is not difficult to implement, but additional coding might be requested for each destination type. The component can be run from a UI or from a DTS package.
3. In SQL Server Yukon is possible to use the CLR and implement the logic in a managed stored procedure or table valued function.

Happy Coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.