Showing posts with label DWH. Show all posts
Showing posts with label DWH. Show all posts

21 October 2023

🧊💫Data Warehousing: Architecture (Part VI: Building a Data Lakehouse for Dynamics 365 Environments with Serverless SQL Pool)

Data Warehousing
Data Warehousing Series

One of the major limitations of Microsoft Dynamics 365 is the lack of direct access to the production databases for reporting purposes via standard reporting or ETL/ELT tools. Of course, one can attempt to use OData-based solutions though they don't scale with the data volume and imply further challenges. 

At the beginning, Microsoft attempted to address this limitation by allowing the export of data entities into customer's Azure SQL database, feature known as bring your own database (BYOD). Highly dependent on batch jobs, the feature doesn't support real-time synchronization and composite entities, and is dependent on the BYOD's database capacity, the scalability after a certain point becoming a bottleneck.

Then Microsoft started to work on two solutions for synchronizing the Dynamics 365 data in near-real time (cca. 10-30 minutes) to the Data Lake: the Export to Data Lake add-in (*), respectively the Azure Synapse Link for Dataverse with Azure Data Lake. The former allows the synchronization of the tables from Finance & Operations (doesn't work for CRM) to files that reflect the database model from the source. In exchange, the latter allows the synchronization of data entities to similar structures, and probably will support tables as well. Because the service works via Dataverse it supports also the synchronization of CRM data. 

The below diagram depicts the flow of data from the D365 environments to the Data Lake, the arrow indicating the direction of the flow. One arrow could be drawn between Dynamics 365 Finance & Operations and the Azure Link for Datavetse service, though one may choose to use only the Export to Data Lake add-in given that a data model based on the tables offers more flexibility (in the detriment of effort though). Data from other systems can be exported via pipelines to the Data Lake to provide an integrated and complete view of the business. 

Serverless Data Lakehouse

Once the data available in the Delta Lake, it can be consumed directly by standard and Power BI paginated reports, however building the data model will involve considerable effort and logic duplication between reports. Therefore, it makes sense to prepare the data upfront in the Data Lake, when possible, and here the serverless SQL pool can help building an enterprise data model. The former approach can still be used for rapid prototyping or data discovery. 

The serverless SQL Server pool is a stateless SQL-based distributed data processing query service over Azure data lake for large-scale data and computational functions. Even if it doesn't support standard tables, it allows to make the data from the Data Lakes files available for processing via external tables, a mechanism that maps files' structure to an entity that can be queried like a normal view (though it supports only read operations). 

Further on, the enterprise data model can be built like in a normal Data Warehouse via the supported objects (views, stored procedures and table-valued functions). These objects can be called from standard and Power BI paginated reports, the queries being processed at runtime anew, which might result occasionally in poor performance. However, the architecture is supposed to scale automatically as needed.

If further performance is needed, parts of the logic or the end-result can be exported to the Data Lake and here the Medallion Architecture should be considered when appropriate. Upon case, further processing might be needed to handle the limitations of the serverless SQL pool (e.g.flattening hierarchies, handling data quality issues).

One can go around the lack of standard table support needed especially for value mappings by storing the respective data as files and/or occasionally by misusing views, respectively by generating Spark tables via the Spark pool. 

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

🧊Data Warehousing: Architecture V (Dynamics 365, the Data Lakehouse and the Medallion Architecture)

Data Warehousing
Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well! 

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy. 

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches. 

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

03 March 2023

🧊Data Warehousing: Architecture (Part IV: Building a Modern Data Warehouse with Azure Synapse)

Data Warehousing

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible.  An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency). 

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production. 

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate. 

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques). 

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable. 

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view. 

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic. 

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over. 

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional. 

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake (*), which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc. 

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services  (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above. 

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool. 

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later. 

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely. 

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link. 


Resources:
[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)
[2] Microsoft Learn (2022) Data warehousing and analytics (link)
[3] Microsoft Learn (2022) Enterprise business intelligence (link)
[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)
[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]
[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)
[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)
[8] James Serra's blog (2022) Data lake architecture (link)
[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)
[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

Data Warehousing

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed. 

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap. 

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings). 

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process. 

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes.  Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed. 

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best. 

Previous Post <<||>> Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

Data Warehousing

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

 

Data Warehousing
Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing
Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

27 November 2020

🧊Data Warehousing: ETL (Part II: An Introduction)

 


ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios. 

ETL Data Flow

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.  

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately. 

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately. 

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior. 

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important. 

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution. 

Previous Post <<||>> Next Post

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

24 May 2020

🧊🎡☯Data Warehousing: SQL Server Integration Services (The Good, the Bad and the Ugly)

Data Warehousing

Microsoft SQL Server Integration Services (SSIS) is a platform for building (enterprise-level) data integrations and data transformations solutions by using a rich set of built-in tasks and transformations, graphical tools for building packages, respectively a catalog for storing the packages. Formally called Data Transformation Services (DTS), it was introduced with SQL Server 2000 and with SQL Server 2005 it was rebranded as SSIS.

The Good: Since its introduction it was adopted by DBAs and (database) programmers because it allowed the import and export of data on the fly from and to SQL Server, flat files, other relational data sources, in fact any resource exposing a driver for ODBC or OLEDB libraries. The extract/load functionality was extended by a basic set of transformations, making from DTS the ideal ETL tool for data warehousing and integrations. The data from multiple sources and targets could be processed in parallel or sequentially, the ETL logic being encapsulated in one or more packages that could be run manually or scheduled via the SQL Server agent flexibly.

With SQL Server 2005 and further versions the SSIS framework was extended to support further data sources including XML, CAML-based SharePoint lists, OData, Hadoop or Azure Bloob. It allowed the storage of packages on the local storage or within the built-in catalog.

One could thus develop rich ETL functionality without writing a single line of code. In theory the packages could be run and modified also by non-IT users, which can be a plus in certain scenarios. On the other side one could build custom packages programmatically from the beginning, and thus extend the available data processing logic as seemed fit, being able to using existing code and whole libraries embedded into the packages or run via dlls calls .

The Bad: Despite the rich functionality, a data pipeline usually has a lower performance and is more difficult to troubleshoot compared with the built-in RDBMS functionality for data processing. Most, if not all transformations can be handled over SQL-based queries more efficiently as long the data are available on the same SQL Server instance. In addition, SQL provides better code reuse, maintainability, chances for refactoring, scalability and the solutions are easier to deploy. Therefore, one practice resumes in using SSIS only for import/export, the further logic being encapsulated into stored procedures and further database objects. This isn’t necessarily bad, on contrary, though specific expertise is needed then to modify the code.

The Ugly: SSIS is in general suitable for data warehousing and integrations solutions whose logic is ideally stable and well-defined. Therefore, SSIS is less suitable for ERP data migrations or similar task which at least at the beginning have an exploratory nature and an overwhelming complexity, multiple iterations being needed before the requirements were fully identified and understood. In extremis each iteration can involve a redesign, which can prove to be time-consuming. One could in theory attempt first understanding all the data, though this could mean starting the development late in the process, while the data for testing are required much earlier. One can still use SSIS for specific tasks, though implementing a whole solution could imply certain challenges that otherwise could have been avoided.

SSIS is not suitable for real-time complex data integrations which require the processing of a considerable amount of data, when specific architectures like SOA, Restful calls or other solution could be more efficient. When not adequately implemented a data integration can lead to more problems than it can solve. Best example is the increase in execution time with the volume of data, fact that can easily lead to time-outs and locking of data.

10 May 2019

🧊💫Data Warehousing: Architecture (Part II: Data Warehousing and Microsoft Dynamics 365)

Data Warehousing

With Dynamics 365 (D365) Online Microsoft made an important strategical move on the ERP market, however in what concerns the BI & Data Warehousing (BI/DW) area Microsoft changed the rules of the game by allowing no direct SQL access to the production environment. This primarily means that will become challenging for organizations to use the existing DW infrastructure to access the D365 data, and for Vendors and Service Providers to provide BI/DW solutions integrated within the D365 platform.

D365 includes its own data warehouse (actually data mart) designed for financial reporting however as per now it can’t be extended to support other business areas. The solution favorited by Microsoft for DW seems to be the use of an Azure SQL Database aka BYOD (Bring Your Own Database) to which entity-based data can be exported incrementally (aka incremental push) or fully (aka full push) via the Data Management Framework (DMF) packages.

Because many of the D365 tables (e.g. Inventory Transactions, Products, Customers, Vendors) were overnormalized over the years and other tables were added as part of new functionality, to hide this complexity, Microsoft introduced a new layer of abstraction formed from data entities organized within an entity store. Data entities are view-like encapsulations of the underlying D365 table schema, the data import/export from and D365 being performed extensively over these data entities via the DMF, which extends the Data Import/Export Framework (DIXF).

One can use thus a BYOD as a direct source for other reporting tools as long they support a connection to Azure, otherwise the data can be further loaded into a database into the cloud, which seems to be the best option until now, as long the organization has other data that need to be consolidated for reporting. From here on, one deals with the traditional way of reporting and the available infrastructure can be extended to use an additional data source.

The BYOD solution comes with several restrictions: a package needs to be created for each business unit, no composite data entities can be exported, data entities that don’t have a unique key can’t be exported via an incremental push, data entities can change over times (new versions being available), while during synchronization no active locks should be on the database. In addition, organizations which followed this path report also some bugs that needed to be addressed via the Microsoft support. Even if the about 1700 available data entities facilitate to some degree data consumption, they seem to be more appropriate for data migrations and data integrations than for DW workloads.

In absence of direct SQL connectivity, in theory organizations can still use SSIS or similar integration tools to connect to D365 production databases and consume data entities via the Open Data Protocol (OData), a standard that defines a set of best practices for building and consuming RESTful APIs. Besides some architectural challenges, loading big tables with transactional data is reportedly slow and impracticable for loading a data warehouse. Therefore, the usability of such an architecture becomes limited in time.

Microsoft imposed a hard limitation upon its D365 architecture by making its production database inaccessible. Of course, there’s still time for Microsoft to do some magic and pull new solutions from the technology stack hat. Unfortunately, the constraints imposed to the production environments limit organizations’ choices of building a modern and flexible data warehouse. For the future it would be great if the DMF could be used directly with standard SQL Server databases, avoiding thus the need for the intermediary Azure database, or if a real-time operational solution could be provided out-of-the-box. We’ll see what the future brings...

28 February 2017

🧊Data Warehousing: Data Load Optimization (Part I: A Success Story)

Data Warehousing
Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

25 March 2010

🧊Data Warehousing: Mea Culpa (Part I: A Personal Journey)

Data Warehousing


Any discussion on data warehousing topics, even unconventional, can’t avoid to mention the two most widely adopted concepts in data warehousing, B. Inmon vs. R. Kimball’s methodologies. There is lot of ink consumed already on this topic and is difficult to come with something new, however I can insert in between my experience and personal views on the topic. From the beginning I have to state that I can’t take any of the two sides because from a philosophical viewpoint I am the adept of “the middle way” and, in addition, when choosing a methodology we have to consider business’ requirements and objectives, the infrastructure, the experience of resources, and many other factors. I don’t believe one method fits all purposes, therefore some flexibility is needed into this concern even from most virulent advocates. After all in the end it counts the degree to which the final solution fits the purpose, and no matter how complex and perfect is a methodology, no matter of the precautions taken, given the complexity of software development projects there is always the risk for failure.

  

B. Inmon defines the data warehouse as a “subject-oriented, integrated, non-volatile and time-varying collection of data in support of the management’s decisions” [3] - subject-oriented because is focused on an organization’s strategic subject areas, integrated because the data are coming from multiple legacy systems in order to provide a single overview, time-variant because data warehouse’s content is time dependent, and non-volatile because in theory data warehouse’s content is not updated but refreshed. 


Within my small library and the internet articles I read on this topic, especially the ones from Kimball University cycle,  I can’t say I found a similar direct definition for data warehouse given by R. Kimball, the closest I could get to something in this direction is the data warehouse as a union of data marts, in his definition a data mart is “a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions” [2]. This reflects also an important difference between the two approaches, in Inmon’s philosophy the data marts are updated through the data warehouse, the data in the warehouse being stored in a 3rd normal form, while in data marts are multidimensional and thus denormalized.


Even if it’s a nice conceptual tool intended to simplify data manipulation, I can’t say I’m a big fan of dimensional modeling, mainly because it can be easily misused to create awful (inflexible) monster models that can be barely used, sometimes being impossible to go around them without redesigning them. Also the relational models could be easily misused though they are less complex as physical design, easier to model and they offer greater flexibility even if in theory data’s normalization could add further complexity, however there is always a trade between flexibility, complexity, performance, scalability, usability and reusability, to mention just a few of the dimensions associated with data in general and data quality in particular.

  

In order to overcome dimensional modeling issues R. Kimball recommends a four step approach – first identifying the business processes corresponding to a business measurement or event, secondly declaring the grain (level of detail) and only after that defining the dimensions and facts [1]. I have to admit that starting from the business process adds a plus to this framework because in theory it allows better visibility over the processes, supporting processed-based data analysis, though given the fact that a process could span over multiple data elements or that multiple processes could partition the same data elements, this increases the complexity of such models. I find that a model based directly on the data elements allows more flexibility in the detriment of the work needed to bring the data together, though they should cover also the processes in scope.

  

Building a data warehouse it’s quite a complex task, especially if we take into consideration the huge percentage of software projects failure that holds also in data warehousing area. On the other side not sure how much such statistics about software projects failure can be taken ad literam because different project methodologies and data collection methods are used, not always detailed information are given about the particularities of each project, it would be however interesting to know what the failure rate per methodology. Occasionally there are some numbers advanced that sustain the benefit of using one or another methodology, and ignoring the subjective approach of such justifications they often lack adequate details to support them.


My first contact with building a data warehouse was almost 8 years ago, when as part of the Asset Management System I was supposed to work on, the project included also the creation of a data warehouse. Frankly few things are more scaring than seeing two IT professionals fighting on what approach to use in order to design a data warehouse, and is needless to say that the fight lasted for several days, calls with the customer, nerves, management involved, whole arsenal of negotiations that looked like a never ending story. 


Such fights are sometimes part of the landscape and they should be avoided, the simplest alternative being to put together the advantages and disadvantages of most important approaches and balance between them, unfortunately there are still professionals who don’t know how or not willing to do that. The main problem in such cases is the time which instead of being used constructively was wasted on futile fights. When lot of time is waisted and a tight schedule applies, one is forced to do the whole work in less time, leading maybe to sloppy solutions. 

  

A few years back I had the occasion to develop one data warehouse around the two ERP systems and the other smaller systems one of the customers I worked for was having in place, SQL Server 2000 and its DTS (Data Transformation Services) functionality being of great help for this purpose. Even if I was having some basic knowledge on the two data warehousing approaches, I had to build the initial data warehouse from scratch evolving the initial solution in time along several years. 


The design was quite simple, the DTS packages extracting the data from the legacy systems and dumping them in staging tables in normalized or denormalized form, after several simple transformations loading the data in the production tables, the role of the multidimensional data marts being played successfully by views that were scaling pretty well to the existing demands. Maybe many data warehouse developers would disregard such a solution, though it was quite an useful exercise and helped me to easier understand later the literature on this topic and the issues related to it. In addition, while working on the data conversion of two ERP implementations I had to perform more complex ETL (Extract Transform Load) tasks that the ones consider in the data warehouse itself.


In what concerns software development I am an adept of rapid evolutional prototyping because it allows getting customers’ feedback in early stages and thus being possible to identify earlier the issues as per customers’ perceptions, in plus allowing customers to get a feeling of what’s possible, how the application looks like. The prototyping method proved to be useful most of the times, I would actually say all the times, and often was interesting to see how customers’ conceptualization about what they need changed with time, changes that looked simple leading to partial redesign of the application. In other development approaches with long releases (e.g. waterfall) the customer gets a glimpse of the application late in the process, often being impossible to redesign the application so the customer has to live with what he got. Call me “old fashion” but I am the adept of rapid evolutional prototyping also in what concerns the creation of data warehouses, and even if people might argue that a data warehousing project is totally different than a typical development project, it should not be forgotten that almost all software development projects share many particularities from planning to deployment and further to maintenance.


Even if also B. Inmon embraces the evolutional/iterative approach in building a data warehouse, from a philosophical standpoint the rapid evolutional prototyping applied to data warehouses I feel it’s closer to R. Kimball’s methodology, resuming in choosing a functional key area and its essential business processes, building a data mart and starting from there building other data marts for the other functional key areas, eventually integrating and aligning them in a common solution – the data warehouse. On the other side when designing a software component or a module of one application you have also to consider the final goal, as the respective component or module will be part of a broader system, even if in some cases it could exist in isolation. Same can be said also about data marts’ creation, even if sometimes a data mart is rooted in the needs of a department, you have to look also at the final goal and address the requirements from that perspective or at least be aware of them.


Previous Post <<||>> Next Post


References:

[1] M. Ross R. Kimball, (2004) Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? [Online] Available from: http://intelligent-enterprise.informationweek.com/info_centers/data_warehousing/showArticle.jhtml;jsessionid=530A0V30XJXTDQE1GHPSKH4ATMY32JVN?articleID=49400912 (Accessed: 18 March 2010)

[2] R. Kimball, J. Caserta (2004). The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley Publishing Inc. ISBN: 0-7645-7923 -1

[3] Inmon W.H. (2005) Building the Data Warehouse, 4th Ed. Wiley Publishing. ISBN: 978-0-7645-9944-6 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.