SQL Troubles: December 2020

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

Graphical Representation Series

From business dashboards to newspapers and other forms of content that capture the attention of average readers, pie charts seem to be one of the most used forms of graphical representation. Unfortunately, their characteristics make them inappropriate for displaying certain types of data, and of being misused. Therefore, there are many voices who advice against using them for any form of display.

It’s hard to agree with radical statements like ‘avoid (using) pie charts’ or ’pie charts are bad’. Each form of graphical representation (aka graphical tool, graphic) has advantages and disadvantages, which makes it appropriate or inappropriate for displaying data having certain characteristics. In addition, each tool can be easily misused, especially when basic representational practices are ignored. Avoiding one representational tool doesn’t mean that the use of another tool will be correct. Therefore, it’s important to make people aware of these aspects and let them decide which tools they should use.

From a graphical tool is expected to represent and describe a dataset in a small area without distorting the reality, while encouraging the reader to compare the different pieces of information, when possible at different levels of details [1] or how they change over time. As form of communication, they encode information and meaning; the reader needs to be able to read, understand and think critically about graphics and data – what is known as graphical/data literacy.

A pie chart consists of a circle split into wedge-shaped slices (aka edges, segments), each slice representing a group or category (aka component). It resembles with the spokes of a wheel, however with a few exceptions they are seldom equidistant. The size of each slice is proportional to the percentage of the component when compared to the whole. Therefore, pie charts are ideal when displaying percentages or values that can be converted into percentages. Thus, the percentages must sum up to 100% (at least that’s readers’ expectation).

Within or besides the slices are displayed components’ name and sometimes the percentages or other numeric or textual information associated with them (Fig. 1-4). The percentages become important when the slices seem to be of equal sizes. As long the slices have the same radius, comparison of the different components resumes in comparing arcs of circles or the chords defined by them, thing not always straightforward. 3-dimensional displays can upon case make the comparison more difficult.

The comparison increases in difficulty with the number of slices increases beyond a certain number. Therefore, it’s not recommended displaying more than 5-10 components within the same chart. If the components exceed this limit, the exceeding components can be summed up within an “other” component.

Within a graphic one needs a reference point that can be used as starting point for exploration. Typically for categorical data this reference point is the biggest or the smallest value, the other values being sorted in ascending, respectively descending order, fact that facilitates comparing the values. For pie charts, this would mean sorting the slices based on their sizes, except the slice for “others” which is typically considered last.

The slices can be filled optionally with meaningful colors or (hashing) patterns. When the same color pallet is used, the size can be reflected in colors’ hue, however this can generate confusion when not applied adequately. It’s recommended to provide further (textual) information when the graphical elements can lead to misinterpretations.

Pie charts can be used occasionally for comparing the changes of the same components between different points in time, geographies (Fig. 5-6) or other types of segmentation. Having the charts displayed besides each other and marking each component with a characteristic color or pattern facilitate the comparison.

Previous Post <<||>> Next Post

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed.

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap.

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings).

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process.

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes. Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed.

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best.

Previous Post <<||>> Next Post

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

As part of the ETL process, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

Data Warehousing Series

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Previous Post <<||>> Next Post

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

SQL Troubles

Pages

31 December 2020

📊Graphical Representation: Graphics We Live by (Part V: Pie Charts in MS Excel)

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

28 December 2020

🧊Data Warehousing: ETL (Part IV: The Load Subprocess)

27 December 2020

🧊Data Warehousing: ETL (Part III: The Extract Subprocess)

🧊☯Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

About Me