SQL Troubles: data transformation

Showing posts with label data transformation. Show all posts

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content).

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply.

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs.

Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort.

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible.

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

10 March 2024

🏭📑Microsoft Fabric: Medallion Architecture [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 10-Mar-2024

Medallion Architecture in Microsoft Fabric [1]

Medallion architecture

a recommended data design pattern used to organize data in a lakehouse logically [2]

compatible with the concept of data mesh

{goal} incrementally and progressively improve the structure and quality of data as it progresses through each stage [1]

brings structure and efficiency to a lakehouse environment [2]
ensures that data is reliable and consistent as it goes through various checks and changes [2]
complements other data organization methods, rather than replacing them [2]

consists of three distinct layers (or zones)

{layer} bronze (aka raw zone)

stores source data in its original format [1]
the data in this layer is typically append-only and immutable [1]
{recommendation} store the data in its original format, or use Parquet or Delta Lake [1]
{recommendation} create a shortcut in the bronze zone instead of copying the data across [1]

works with OneLake, ADLS Gen2, Amazon S3, Google

{operation} ingest data

{characteristic} maintains the raw state of the data source [3]
{characteristic} is appended incrementally and grows over time [3]
{characteristic} can be any combination of streaming and batch transactions [3]
⇒ retaining the full, unprocessed history

⇒ provides the ability to recreate any state of a given data system [3]

additional metadata may be added to data on ingest

e.g. source file names, recording the time data was processed

{goal} enhanced discoverability [3]
{goal} description of the state of the source dataset [3]
{goal} optimized performance in downstream applications [3]

{layer} silver (aka enriched zone)

stores data sourced from the bronze layer
the raw data has been

cleansed
standardized
structured as tables (rows and columns)
integrated with other data to provide an enterprise view of all business entities

{recommendation} use Delta tables

provide extra capabilities and performance enhancements [1]

{default} every engine in Fabric writes data in the delta format and use V-Order write-time optimization to the Parquet file format [1]

{operation} validate and deduplicate data
for any data pipeline, the silver layer may contain more than one table [3]

{layer} gold (aka curated zone)

stores data sourced from the silver layer [1]
the data is refined to meet specific downstream business and analytics requirements [1]
tables typically conform to star schema design

supports the development of data models that are optimized for performance and usability [1]

use lakehouses (one for each zone), a data warehouse, or combination of both

the decision should be based on team's preference and expertise of your team.
different analytic engines can be used [1]

⇐ schemas and tables within each layer can take on a variety of forms and degrees of normalization [3]

depends on the frequency and nature of data updates and the downstream use cases for the data [3]

{pattern} create each zone as a lakehouse

business users access data by using the SQL analytics endpoint [1]

{pattern} create the bronze and silver zones as lakehouses, and the gold zone as data warehouse

business users access data by using the data warehouse endpoint [1]

{pattern} create all lakehouses in a single Fabric workspace

{recommendation} create each lakehouse in its own workspace [1]
provides more control and better governance at the zone level [1]

{concept} data transformation

involves altering the structure or content of data to meet specific requirements [2]

via Dataflows (Gen2), notebooks

{concept} data orchestration

refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome [2]

via data pipelines

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn: Fabric (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Organize a Fabric lakehouse using medallion architecture design (link)
[3] Microsoft Learn: Azure (2023) What is the medallion lakehouse architecture? (link)

Resources:
[R1] Serverless.SQL (2023) Data Loading Options With Fabric Workspaces, by Andy Cutler (link)
[R2] Microsoft Learn: Fabric (2023) Lakehouse end-to-end scenario: overview and architecture (link)
[R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Store Gen2

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed.

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap.

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings).

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process.

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes. Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed.

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best.

Previous Post <<||>> Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

23 March 2010

🕋Data Warehousing: Data Transformation (Definitions)

"A set of operations applied to source data before it can be stored in the destination, using Data Transformation Services (DTS). For example, DTS allows new values to be calculated from one or more source columns, or a single column to be broken into multiple values to be stored in separate destination columns. Data transformation is often associated with the process of copying data into a data warehouse." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of reformatting data based on predefined rules. Most often identified as part of ETL (extraction, transformation, and loading) but not exclusive to ETL, transformation can occur on the CDI hub, which uses one of several methods to transform the data from the source systems before matching it." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Any change to the data, such as during parsing and standardization." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A process by which the format of data is changed so it can be used by different applications." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Converting data from one format to another|making the data reflect the needs of the target application. Used in almost any data initiative, for instance, a data service or an ETL (extract, transform, load) process." (Tony Fisher, "The Data Asset", 2009)

"Changing the format, structure, integrity, and/or definitions of data from the source database to comply with the requirements of a target database." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The SSIS data flow component that modifies, summarizes, and cleans data." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Data transformation is the process of making the selected data compatible with the structure of the target database. Examples include: format changes, structure changes, semantic or context changes, deduping, and reordering." (Piethein Strengholt, "Data Management at Scale", 2020)

"1. In data warehousing, the process of changing data extracted from source data systems into arrangements and formats consistent with the schema of the data warehouse. 2. In Integration Services, a data flow component that aggregates, merges, distributes, and modifies column data and rowsets." (Microsoft Technet)

12 March 2010

🕋Data Warehousing: Pivot/Unpivot (Definitions)

"To rotate rows to columns, and columns to rows, in a cross-tabular data browser. Also refers to choosing dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of rotating the view of data. For example, viewing what was the x-axis in the y-axis’s position and vice versa." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The UNPIVOT operator is used within a SELECT statement to create a normalized data report from data that is stored as a spreadsheet." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The PIVOT operator is used within a SELECT statement. It is used to create cross-tab reports (similar to a spreadsheet) from normalized data." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The act of rotating rows to columns, and columns to rows." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

[unpivot:] "To expand values from multiple columns in a single record into multiple records with the same values in a single column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"To rotate a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output, and perform aggregations where they are required on any remaining column values that are wanted in the final output." (Microsoft, "SQL Server 2012 Glossary", 2012)

"In the Lean start-up world, a pivot is a structured, often rapid, course correction on the basis of new market, customer, and development information." (Pamela Schure & Brian Lawley, "Product Management For Dummies", 2017)

"1. To rotate rows to columns, and columns to rows, in a cross-tabular data browser. 2. To choose dimensions from the set of available dimensions in a multidimensional data structure for display in the rows and columns of a cross-tabular structure." (Microsoft Technet)

[unpivot:] "In Integration Services, the process of creating a more normalized dataset by expanding data columns in a single record into multiple records." (Microsoft Technet)

07 February 2010

🎡⌛SSIS: SQL Server to Oracle Data Export (Second Magic Class)

In a previous post I described the steps needed to be followed in order to export the data from SQL Server to Excel, this time I will describe the steps needed in order to export the data from SQL Server to Oracle. For this is requested to have Oracle database product installed on your computer, or in case you are having already an Oracle database server on the network then you’ll need only the Oracle Client installed locally, and a SQL Development Tool like SQL Developer or TOAD, with a plus for the later.

Like in the previous tutorial we’ll use the Production.Product table from AdventureWorks database coming with SQL Server.

Step 1: Create an Oracle User

If you already have an Oracle User created then you could skip this step, unless you want to create a User only for the current tutorial or for loading other AdventureWorks tables in Oracle. For this we will create first a permanent tablespace (e.g. AdventureWorks), an allocation of space in the database, then create the actual User (e.g. SQLServer) on the just created tablespace, action the will create a schema with the same name as the User, and finally grant the User ‘create session’, ‘create materialized view’ and ‘create table’ privileges.

Open your SQL Developer tool of choice and connect to the Oracle database, then type the following statements and run them one by one, do not forget to provide a strong-typed password in IDENTIFIED BY clause, then create a new connection using the just created User.

-- Create TableSpace
CREATE BIGFILE TABLESPACE AdventureWorks
DATAFILE 'AdventureWorks.dat'
SIZE 20M AUTOEXTEND ON; 

-- Create User CREATE USER SQLServer 
IDENTIFIED BY <your_password>
DEFAULT TABLESPACE AdventureWorks 
QUOTA 20M ON AdventureWorks; 

-- Grant privileges
GRANT CREATE SESSION
, CREATE MATERIALIZED VIEW
, CREATE TABLE TO SQLServer;

Step 2: Start SQL Server Import and Export Wizard

From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data

Step 3: Choose a Data Source

In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.

Step 4: Choose a Destination

In ‘Choose a Destination’ step select the Destination, in this case ‘Oracle Provider for OLE DB’ and then click on ‘Properties’ in order to provide the connectivity details, in ‘Data Link Properties’ dialog just opened, enter the Data Source (the SID of your Oracle database), the ‘User name’ (e.g. SQLServer), the ‘Password’ and check ‘Allow saving password’ checkbox, without this last step being not possible to connect to the Oracle database. Then test the connection by clicking the ‘Test Connection’, and if the ‘Test connection succeeded’ proceed to the next step.

Note:
Excepting ‘Oracle Provider for OLE DB’ there are three other drivers that allows you to connect to an Oracle database: ‘Microsoft OLE DB Provider for Oracle’, ‘.Net Framework Data Provider for Oracle’, respectively ‘Oracle Data Provider for .Net’, each of them coming with their own downsizes and benefits.

Step 5: Specify Table Copy Or Query

In this step choose: ‘Copy data from one or more tables or views’ option.

Step 6: Select Source Tables and Views

In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data. You could go with the provided Destination table, though if you are using the current settings Oracle table’s name will be “Product” (including quotes) and not Product as we’d expected. There are two options, go with the current settings and change table’s name and columns, they having also this problem, in Oracle, or create the table manually in Oracle. I prefer the second option because I could use the SQL script automatically generated by Oracle, for this with the just chosen table selected click on ‘Edit Mappings’ button that will bring the ‘Column Mappings’ dialog and click the ‘Edit SQL’ button, the query appearing as editable in ‘Create Table SQL Statement’ dialog.

Copy the query and copy paste it in the Oracle SQL Developer, then do a replace of the quotes and delete the rowguid column as we don’t needed, in plus when attempting to dump the data, given the incompatibility of data types, an error will be raised. There is a second change we’ll have to make, to rename the Size column to ProductSize, this because Size is a reserved keyword.

CREATE TABLE sqlserver.Product (
ProductID INTEGER NOT NULL,
Name NVARCHAR2(50) NOT NULL,
ProductNumber NVARCHAR2(25) NOT NULL,
MakeFlag NUMBER NOT NULL,
FinishedGoodsFlag NUMBER NOT NULL,
Color NVARCHAR2(15),
SafetyStockLevel INTEGER NOT NULL,
ReorderPoint INTEGER NOT NULL,
StandardCost NUMBER NOT NULL,
ListPrice NUMBER NOT NULL,
ProductSize NVARCHAR2(5),
SizeUnitMeasureCode NCHAR(3),
WeightUnitMeasureCode NCHAR(3),
Weight NUMBER,
DaysToManufacture INTEGER NOT NULL,
ProductLine NCHAR(2),
Class NCHAR(2),
Style NCHAR(2),
ProductSubcategoryID INTEGER,
ProductModelID INTEGER,
SellStartDate TIMESTAMP NOT NULL,
SellEndDate TIMESTAMP,
DiscontinuedDate TIMESTAMP,
ModifiedDate TIMESTAMP NOT NULL)

Unfortunately this approach has a downside, we have to go two steps back in Choose a Destination, then click on ‘Properties’ again and without doing any change test the connection by clicking the ‘Test Connection’, then proceed forward to the current step, and in ‘Select Source Tables and Views’ step select again the database object (e.g. Production.vProducts) from which you’ll export the data but this time in Destination dropdown control search for the table just created (e.g. sqlserver.Product).

Because we have removed the rowguid column from the table created in Oracle, we’ll have to change the mappings, for this click on ‘Edit Mappings’ and in ‘Mapping’ list box from ‘Column Mappings’, in the Destination dropdown next to rowguid column select ‘’, thus the respective column being ignored from the mapping. Accept the change and proceed to the next step.

Note:
You could choose to export more than one table, though for this you’ll have to create manually each table in Oracle and then map the created tables in ‘Select Source Tables and Views’ step.

Step 7: Review Data Type Mapping

Step 8: Save and Run Package

In ‘Save and Run Package’ check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.

Step 9: Save SSIS Package

In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products To Oracle) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.

Step 10: Complete the Wizard

Step 11: Executing the Package

Step 12: Checking the Data

If the previous step completed successfully, then you can copy the following statement in Oracle SQL Developer and execute it, the table should be already populated.

SELECT * 
FROM sqlserver.Product

Note:
If you used other schema to connect to, then you’ll have to replace the above schema (e.g. SQLServer) with your schema.

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

I’ve been working with DTS (Data Transformation Services) packages for the past 7-8 years, finding SQL Server’s 2000 functionality pretty useful in handling ETL (Extract, Transform, Load) tasks, especially when importing and exporting data between multiple data sources. The functionality provided by the DTS platform was basic, though it could be extended in custom packages with code developed in VB or by using ActiveX tasks in VBScript/Jscript, when the SQL-based logic was not enough. In addition, all three types of coding could make calls to .dlls, and thus operating system, vendor or in-house built libraries or simple functions could be (re)used.

Starting with SQL Server 2005, DTS was replaced by the SSIS (SQL Server Integration Services) component, becoming with SSRS (SQL Reporting Services) and SSAS (SQL Server Analysis Services) integrant part of Microsoft's data platform. Besides providing a new architecture, SSIS extended the functionality previously available, bringing more flexibility in constructing the packages, their elements and data manipulation. SSIS became with each new version of SQL Server (2008, 2012, 2014, 2016, 2017, 2019) a powerful ETL tool that can be easily used for Data Warehousing, Data Migration or Data Integration projects.

In what concerns ETL we could say there are two main philosophies - have as much of the business logic in the (ETL) package, or use the package mainly for loading data from the various sources and have the business logic in the database as SQL-based code. As always, each of the two philosophies has its own advantages and disadvantages, though I would consider a third philosophy – design for performance, reuse and maintainability, resulting thus a hybrid between the first two mentioned philosophies.

There are several other factors that need to be considered when building ETL solutions - synchronization, testability, security, stability, scalability, complexity, learning curve, etc. I would say that there is no perfect receipe, no architecture matching all requirements, each solution coming with its needs and constraints, sometimes being a good idea to go one level of abstraction above the requirements, while other times is better to stick to the requirements and problem at hand.

Performance

Designing for performance resumes in choosing the architecture/methods that provides the best performance individually and as a whole – either using package-based functionality, SQL-based functionality or a combination of both. In general SQL code is best suited for query-based data manipulation, while packages are better suited for sequential or workflow-based processing of data, though there could be exceptions and exceptions. Often it’s a good idea to test the performance of all alternative approaches via a prototype, even if in time the developer arrives to a good knowledge of the methods that suit best from a performance point of view.

Reuse

Each of the two architectures allow a lower or higher degree of reuse using parameters, variables and compartmenting of code, maybe with a plus for SQL code which has in theory a greater maturity and flexibility than the package-based functionality, allowing a wider range of reuse resulted from the compartmenting of code in the various supported objects (stored procedures, functions, views or tables).

Maintainability

Maintainability, the easiness of modifying packages or code, is an important factor because few are the cases in which the logic doesn’t change over time, many projects having to deal with change in requirements, sometimes implying a 180 degrees change of the overall perspective. Therefore it's important to be able to modify the code/package and even change the architecture with a minimum of effort.

Refactoring

Refactoring resumes in modifying the code without changing the functional behavior in order to improve the performance, readability, maintainability or extensibility, minimize the use of resources, remove redundant code, adhere to standards, best practices or naming conventions. It is said that there is always place for improvement, performance being the dimension with the most important impact in the world of databases. Refactoring is not always necessary, though it’s a good practice for achieving high quality solutions.

Synchronization

I would say there are three types of synchronization – of scope, business logic and data, the first targeting to synchronize the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data. Considering the example of Invoices – Headers and Lines, the synchronization of scope would resume to apply the same constraints for the two, assuring that there is no Invoice Header without Lines, and vice-versa; the data synchronization between the two referring to the fact that the data between the two data sets should be consistent, there should be no change in Invoice Headers not reflected also at Line level (e.g. total amount matching between Lines and Header). Business logic synchronization refers typically to the use of the same set of data for similar purposes, if several transformations were used for Invoice Headers, they should be reflected accordingly also at Invoice Line level. Synchronization it’s actually quite an important topic, therefore I will reconsider it in a further post.

Testing & Troubleshooting

I find that the business logic implemented in SQL code is much easier to test than the logic implemented in packages, because in the first situation each object and step could be in theory tested individually/progressively, being thus much easier to troubleshoot. Troubleshooting packages logic can be quite complex because is not always possible to view the input/output for each intermediary step.

Complexity

Complexity is reflected in the easiness of understanding the logic broken down to pieces and as a whole. Packages are highly visual, being in theory easier to identify and understand the steps and their flow, even by non-technical people, while SQL-code might need auxiliary representations for the same purpose (e.g. data flow diagram) and need a higher level of expertize.

Security

Security is always a sensitive and complex topic, and in general it resumes to how secure is the code and sensitive information stored (e.g. user name & password, data), who has access to execute it and the context in which the code is run. This can become easily quite a complex topic, being highly dependent on the architecture used.

Stability

We can discuss about platform and design stability, which can be often a matter of perception and experience. Both SSIS and database engines could be considered as stable development environments, the later having in theory a greater maturity and flexibility, flexibility that could be easily brought to extreme creating bad coding monsters (e.g. loop calls to .dll libraries), impacting thus design’s stability, which is correlated to the adequate use of functionality, techniques and resources – each technology has its does and don’ts, strength and weakness.

Deployment

Deployment of business logic on the server can be quite easy or quite complex, depending on the overall architecture, the number of configurable items in scope, and the complexity of the dependencies existing between them. The deployment usually resumes at copying the code from one location to another, installing the eventual dependencies and configuring the objects for use.

Scalability

Scalability in this context refers mainly to the degree the business logic can cope with the increased volume of records, and not necessarily with the number of requests, though this aspect could be considered too, after case. SSIS and the database engines are designed to be highly scalable, though there are architectures and architectures, good uses and misuses of techniques. Designing for performance in theory equates with good scalability, unless the requirements makes it difficult to have a scalable solution.

Learning curve

The learning curve of technologies is always an important factor that needs to be considered in development, as it reflects how much time an average developer needs to master the basic/average/complex functionality provided by the respective technology. For ETL development is in general requested an average knowledge of both - SSIS architecture, respectively SQL-based programming - though it’s not easy to acquire both problem-solving mindsets. A SSIS developer usually attempts using as much as possible the functionality provided by SSIS, while a SQL developer the SQL-based functionality. In the end, it’s important to know how to balance between the two.

Previous Post <<||>> Next Post

SQL Troubles

Pages

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

10 March 2024

🏭📑Microsoft Fabric: Medallion Architecture [Notes]

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

23 March 2010

🕋Data Warehousing: Data Transformation (Definitions)

12 March 2010

🕋Data Warehousing: Pivot/Unpivot (Definitions)

07 February 2010

🎡⌛SSIS: SQL Server to Oracle Data Export (Second Magic Class)

23 January 2010

🧊🎡Data Warehousing: ETL (SSIS packages vs. SQL code)

About Me