Showing posts with label data lake. Show all posts
Showing posts with label data lake. Show all posts

03 April 2024

Business Intelligence: The Top 5 Pains of a BI/Analytics Manager

Business Intelligence Series
Business Intelligence Series

1) Business Strategy

A business strategy is supposed to define an organization's mission, vision, values, direction, purpose, goals, objectives, respectively the roadmap, alternatives, capabilities considered to achieve them. All this information is needed by the BI manager to sketch the BI strategy needed to support the business strategy. 

Without them, the BI manager must extrapolate, and one thing is to base one's decisions on a clearly stated and communicated business strategy, and another thing to work with vague declarations full of uncertainty. In the latter sense, it's like attempting to build castles into thin air and expecting to have a solid foundation. It may work as many BI requirements are common across organizations, but it can also become a disaster. 

2) BI/Data Strategy

Organizations usually differentiate between the BI and the data Strategy because different driving forces and needs are involved, even if there are common goals, needs and opportunities that must be considered from both perspectives. When there's no data strategy available, the BI manager is either forced to address thus many data-related topics (e.g. data culture, data quality, metadata management, data governance), or ignore them with all consequences deriving from this. 

A BI strategy is an extension of the business, data and IT strategies into the BI knowledge areas. Unfortunately, few organizations give it the required attention. Besides the fact that the BI strategy breaks down the business strategy from its perspective, it also adds its own goals and objectives which are ideally aligned with the ones from the other strategies. 

3) Data Culture

Data culture is "the collective beliefs, values, behaviors, and practices of an organization’s employees in harnessing the value of data for decision-making, operations, or insight". Therefore, data culture is an enabler which, when the many aspects are addressed adequately, can have a multiplier effect for the BI strategy and its execution. Conversely, when basic data culture assumptions and requirements aren't addressed, the interrelated issues resulting from this can prove to be a barrier for the BI projects, operations and strategy. 

As mentioned before, an organization’s (data) culture is created, managed, nourished, and destroyed through leadership. If the other leaders aren't playing along, each challenge related to data culture and BI will become a concern for the BI manager.

4) Managing Expectations 

A business has great expectations from the investment in its BI infrastructure, especially when the vendors promise competitive advantage, real-time access to data and insights, self-service capabilities, etc. Even if these promises are achievable, they represent a potential that needs to be harnessed and there are several premises that need to be addressed continuously. 

Some BI strategies and/or projects address these expectations from the beginning, though there are many organizations that ignore or don't give them the required importance. Unfortunately, these expectations (re)surface when people start using the infrastructure and this can easily become an acceptance issue. It's the BI manager's responsibility to ensure expectations are managed accordingly.

5) Building the Right BI Architecture

For the BI architecture the main driving forces are the shifts in technologies from single servers to distributed environments, from relational tables and data warehouses to delta tables and delta lakes built with the data mesh's principles and product-orientation in mind, which increase the overall complexity considerably. Vendors and data professionals' vision of how the architectures of the future will look like still has major milestones and challenges to surpass. 

Therefore, organizations are forced to explore the new architectures and the opportunities they bring, however this involves a considerable effort, skilled resources, and more iterations. Conversely, ignoring these trends might prove to be an opportunity lost and eventually duplicated effort on the long term.

31 March 2024

Microsoft Fabric: Polaris (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 31-Mar-2024

Polaris

  • {definition} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]
    • the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
    • petabyte-scale execution [1]
    • highly-available micro-service architecture
      • data and query processing is packaged into units (aka tasks) [1]
        • can be readily moved across compute nodes and re-started at the task level [1]
    • can run directly over data in HDFS and in managed transactional stores [1]
  • [Azure Synapse] designed initially to execute read-only queries [1]
    • ⇐ the architecture behind serverless SQL pool
    • uses a completely new scale-out framework based on a distributed SQL Server query engine [1]
        • fully compatible with T-SQL
        • leverages SQL Server single-node runtime and QO [1]
  • [Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]
    • incl. updates, deletes and bulk loads [2]
    • based on [delta tables] and [delta lake]
      • the delta lake supports currently only transactions within one table [4]
    • ⇐ the architecture behind lakehouses
  • {goal} converge DWH and big data workloads [1]
    • the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]
      • needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing
  • {goal} separate compute and state for cloud-native execution [1]
    • all services within a pool are stateless
      • data is stored durably in remote storage and is abstracted via data cells [1]
        • ⇐ data is naturally decoupled from compute nodes
    • the metadata and transactional log state is off-loaded to centralized services [[1]
    • multiple compute pools can transactionally access the same logical database [1]
  • {goal} cloud-first [2]
    • {benefit} leverages elasticity
    • transactions need to be resilient to node failures on dynamically changing topologies [2]
      •  ⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]
    • must ensure disaggregation of metadata and transactional state from compute nodes [2]
      • ⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]
        • ⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]
  • {goal} use optimized native columnar, immutable and open storage format [2]
    • uses delta format 
      • ⇐ optimized to handle read-heavy workloads with low contention [2] 
  • {goal} leverage the full potential of vectorized query processing for SQL [2]
  • {goal} support zero-copy data sharing with other services in the lake [2]
  • {goal} support read-heavy workloads with low contention [2]
  • {goal} support lineage-based features [2]
    • by taking advantage of delta table capabilities 
  • {goal} provide full SQL SI transactional support [2]
    • {benefit} all traditional DWH requirements are met [2]
      • incl. multi-table and multi-statement transactions [2]
        • ⇐ Polaris is the only system that supports this [2]
        • the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
        • mixes of transactions are supported as well
  • {objective} no cross-component state sharing [2] 
    • {principle} encapsulation of state within each component to avoid sharing state across nodes [2]
    • SI and the isolation of state across components allows to execute transactions as if they were queries [2]
      • ⇒ makes read and write transactions indistinguishable [2]
        • ⇒ allows to fully leverage its optimized distributed execution framework [2]
  • {objective} support snapshot Isolation (SI) semantics [2]
    • implemented over versioned data
    • allows reads (R) and writes (W) to proceed concurrently over their own data snapshot 
      • R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2] 
      • ⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]
        • follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]
      • ⇐  R transactions at the snapshot isolation level create no contention
        •  ⇒  any number of R transactions can run concurrently [4]
    • the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]
      • similar to how temporary tables are discarded during query processing failures [2]
  • {feature} resize live workloads [1]
    • scales resources with the workloads automatically
  • {feature} deliver predictable performance at scale [1]
    • scales computational resources based on workloads' needs
  • {feature} efficiently handle both relational and unstructured data [1]
  • {feature} flexible, fine-grained task monitoring
    • a task is the finest grain of execution 
  • {feature} global resource-aware scheduling
    • enables much better resource utilization and concurrency than traditional DWHs
      • capable of handling partial query restarts
      • maintains a global view of multiple queries
    • it is planned to build on this a global view with autonomous workload management features
  • {feature} multi-layered data caching model
    • leverages 
      • SQL Server buffer pools for cashing columnar data
      • SSD caching
    • the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]
  • {feature} tracks data lineage natively
    • the transaction log can also be used to audit logging based on the commit Info records [4]
  • {feature} versioning
    • maintain all versions as data is updated [1]
  • {feature} time-travel
    • {benefit} allows users query point-in-time snapshots
    • {benefit)} allows to roll back erroneous updates to the data.
  • {feature} table cloning
    • {benefit} allows to create a point-in-time snapshot of the data based on its metadata
  • {concept} state 
    • allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
    • comprised of 
      • cache
      • metadata
      • transaction logs
      • data
    • [on-premises architecture] all state is in the compute layer
      • relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
      • {downside} expensive
      • {downside} hard to maintain
      • {downside} limited scalability
        • cluster capacity is bounded by machine sizes because of the fixed topology
  • {concept}[stateful architecture
    • the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]
      • ⇒ when a compute node fails, the state of non-committed transactions is lost [1] 
        •  ⇒ the in-flight transactions fail as well [1]
    • often also couples metadata describing data distributions and mappings to compute nodes [1] 
      • ⇒ a compute node effectively owns responsibility for processing a subset of the data [1] 
        • its ownership cannot be transferred without a cluster restart [1]
    • {downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]
  • {concept} [stateless compute architecture
    • requires that compute nodes hold no state information [1]
      • ⇒ all data, transactional logs and metadata need to be externalized [1]
    • {benefit} allows applications to 
      • partially restart the execution of queries in the event of compute node failures [1] 
      • adapt to online changes of the cluster topology without failing in-flight transactions [1] 
    • caches need to be as close to the compute as possible [1] 
      • since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1] 
        • the coupling of caches and compute does not make the architecture stateful [1] 
  • {concept} [cloud] decoupling of compute and storage
    • provides more flexible resource scaling
      • the 2 layers can scale up and down independently adapting to user needs [1] 
      • customers pay for the compute needed to query a working subset of the data [1] 
    • is not the same as decoupling compute and state [1] 
      • if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1] 
Acronyms:
ADLS - Azure Data Lake Storage
CRUD - Create, Read, Update, Delete
DCP - distributed computation platform 
DFS - Distributed File System
DWH - data warehouse
HDFS - Hadoop DFS
SI - Semantic Isolation 
SSD - Solid-State Drive

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12)  (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

14 March 2024

Business Intelligence: Monolithic vs. Distributed Architecture (Part I: Zhamak Dehghani's Data Mesh - Debunked)

Business Intelligence
Business Intelligence Series

In [1] the author categorizes data warehouses (DWHs) and lakes as monolithic architectures, as opposed to data mesh's distributed architecture, which makes me circumspect about term's use. There are two general definitions of what monolithic means: (1) formed of a single large block (2) large, indivisible, and slow to change.

In software architecture one can differentiate between monolithic applications where the whole application is one block of code, multi-tier applications where the logic is split over several components with different functions that may reside on the same machine or are split non-redundantly between multiple machines, respectively distributed, where the application or its components run on multiple machines in parallel.

Distributed multi-tire applications are a natural evolution of the two types of applications, allowing to distribute redundantly components across multiple machines. Much later came the cloud where components are mostly entirely distributed within same or across distinct geo-locations, respectively cloud providers.

Data Warehouse vs. Data Lake vs. Lakehouse [2]

From licensing and maintenance convenience, a DWH resides typically on one powerful machine with many chores, though components can be moved to other machines and even distributed, the ETL functionality being probably the best candidate for this. In what concerns the overall schema there can be two or more data stores with different purposes (operational/transactional data stores, data marts), each of them with their own schema. Each such data store could be moved on its own machine though that's not feasible.

DWHs tend to be large because they need to accommodate a considerable number of tables where data is extracted, transformed, and maybe dumped for the various needs. With the proper design, also DWHs can be partitioned in domains (e.g. define one schema for each domain) and model domain-based perspectives, at least from a data consumer's perspective. The advantage a DWH offers is that one can create general dimensions and fact tables and build on top of them the domain-based perspectives, minimizing thus code's redundancy and reducing the costs.  

With this type of design, the DWH can be changed when needed, however there are several aspects to consider. First, it takes time until the development team can process the request, and this depends on the workload and priorities set. Secondly, implementing the changes should take a fair amount of time no matter of the overall architecture used, given that the transformations that need to be done on the data are largely the same. Therefore, one should not confuse the speed with which a team can start working on a change with the actual implementation of the change. Third, the possibility of reusing existing objects can speed up changes' implementation. 

Data lakes are distributed data repositories in which structured, unstructured and semi-structured data are dumped in raw form in standard file formats from the various sources and further prepared for consumption in other data files via data pipelines, notebooks and similar means. One can use the medallion architecture with a folder structure and adequate permissions for domains and build reports and other data artefacts on top. 

A data lake's value increases when is combined with the capabilities of a DWH (see dedicated SQL server pool) and/or analytics engine (see serverless SQL pool) that allow(s) building an enterprise semantic model on top of the data lake. The result is a data lakehouse that from data consumer's perspective and other aspects mentioned above is not much different than the DWH. The resulting architecture is distributed too. 

Especially in the context of cloud computing, referring to nowadays applications metaphorically (for advocative purposes) as monolithic or distributed is at most a matter of degree and not of distinction. Therefore, the reader should be careful!

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Databricks (2022) Data Lakehouse (link)

13 March 2024

Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021) is a must read book for the data professional. So, here I am, finally managing to read it and give it some thought, even if it will probably take more time and a few more reads for the ideas to grow. Working in the fields of Business Intelligence and Software Engineering for almost a quarter-century, I think I can understand the historical background and the direction of the ideas presented in the book. There are many good ideas but also formulations that make me circumspect about the applicability of some assumptions and requirements considered. 

So, after data marts, warehouses, lakes and lakehouses, the data mesh paradigm seems to be the new shiny thing that will bring organizations beyond the inflection point with tipping potential from where organization's growth will have an exponential effect. At least this seems to be the first impression when reading the first chapters. 

The book follows to some degree the advocative tone of promoting that "our shiny thing is much better than previous thing", or "how bad the previous architectures or paradigms were and how good the new ones are" (see [2]). Architectures and paradigms evolve with the available technologies and our perception of what is important for businesses. Old and new have their place in the order of things, and the old will continue to exist, at least until the new proves its feasibility.  

The definition of the data mash as "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1] is too abstract even if it reflects at high level what the concept is about. Compared to other material I read on the topic, the book succeeds in explaining the related concepts as well the goals (called definitions) and benefits (called motivations) associated with the principles behind the data mesh, making the book approachable also by non-professionals. 

Built around four principles "data as a product", "domain-oriented ownership", "self-serve data platform" and "federated governance", the data mesh is the paradigm on which data as products are developed; where the products are "the smallest unit of architecture that can be independently deployed and managed", providing by design the information necessary to be discovered, understood, debugged, and audited.

It's possible to create Lego-like data products, data contracts and/or manifests that address product's usability characteristics, though unless the latter are generated automatically, put in the context of ERP and other complex systems, everything becomes quite an endeavor that requires time and adequate testing, increasing the overall timeframe until a data product becomes available. 

The data mesh describes data products in terms of microservices that structure architectures in terms of a collection of services that are independently deployable and loosely coupled. Asking from data products to behave in this way is probably too hard a constraint, given the complexity and interdependency of the data models behind business processes and their needs. Does all the effort make sense? Is this the "agility" the data mesh solutions are looking for?

Many pioneering organizations are still fighting with the concept of data mesh as it proves to be challenging to implement. At a high level everything makes sense, but the way data products are expected to function makes the concept challenging to implement to the full extent. Moreover, as occasionally implied, the data mesh is about scaling data analytics solutions with the size and complexity of organizations. The effort makes sense when the organizations have a certain size and the departments have a certain autonomy, therefore, it might not apply to small to medium businesses.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) "Data Mesh: Delivering Data-Driven Value at Scale" (link)
[2] SQL-troubles (2024) Zhamak Dehghani's Data Mesh - Monolithic Warehouses and Lakes (link)

21 October 2023

Data Warehousing: Building a Data Lakehouse for Dynamics 365 Environments with Serverless SQL Pool

Data Warehousing
Data Warehousing Series

One of the major limitations of Microsoft Dynamics 365 is the lack of direct access to the production databases for reporting purposes via standard reporting or ETL/ELT tools. Of course, one can attempt to use OData-based solutions though they don't scale with the data volume and imply further challenges. 

At the beginning, Microsoft attempted to address this limitation by allowing the export of data entities into customer's Azure SQL database, feature known as bring your own database (BYOD). Highly dependent on batch jobs, the feature doesn't support real-time synchronization and composite entities, and is dependent on the BYOD's database capacity, the scalability after a certain point becoming a bottleneck.

Then Microsoft started to work on two solutions for synchronizing the Dynamics 365 data in near-real time (cca. 10-30 minutes) to the Data Lake: the Export to Data Lake add-in (*), respectively the Azure Synapse Link for Dataverse with Azure Data Lake. The former allows the synchronization of the tables from Finance & Operations (doesn't work for CRM) to files that reflect the database model from the source. In exchange, the latter allows the synchronization of data entities to similar structures, and probably will support tables as well. Because the service works via Dataverse it supports also the synchronization of CRM data. 

The below diagram depicts the flow of data from the D365 environments to the Data Lake, the arrow indicating the direction of the flow. One arrow could be drawn between Dynamics 365 Finance & Operations and the Azure Link for Datavetse service, though one may choose to use only the Export to Data Lake add-in given that a data model based on the tables offers more flexibility (in the detriment of effort though). Data from other systems can be exported via pipelines to the Data Lake to provide an integrated and complete view of the business. 

Serverless Data Lakehouse

Once the data available in the Delta Lake, it can be consumed directly by standard and Power BI paginated reports, however building the data model will involve considerable effort and logic duplication between reports. Therefore, it makes sense to prepare the data upfront in the Data Lake, when possible, and here the serverless SQL pool can help building an enterprise data model. The former approach can still be used for rapid prototyping or data discovery. 

The serverless SQL Server pool is a stateless SQL-based distributed data processing query service over Azure data lake for large-scale data and computational functions. Even if it doesn't support standard tables, it allows to make the data from the Data Lakes files available for processing via external tables, a mechanism that maps files' structure to an entity that can be queried like a normal view (though it supports only read operations). 

Further on, the enterprise data model can be built like in a normal Data Warehouse via the supported objects (views, stored procedures and table-valued functions). These objects can be called from standard and Power BI paginated reports, the queries being processed at runtime anew, which might result occasionally in poor performance. However, the architecture is supposed to scale automatically as needed.

If further performance is needed, parts of the logic or the end-result can be exported to the Data Lake and here the Medallion Architecture should be considered when appropriate. Upon case, further processing might be needed to handle the limitations of the serverless SQL pool (e.g.flattening hierarchies, handling data quality issues).

One can go around the lack of standard table support needed especially for value mappings by storing the respective data as files and/or occasionally by misusing views, respectively by generating Spark tables via the Spark pool. 

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

Data Warehousing: Dynamics 365, the Data Lakehouse and the Medallion Architecture

Data Warehousing
Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well! 

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy. 

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches. 

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

10 October 2023

Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

Data Warehousing
Data Warehousing Series

Introduction

Besides the fact that data professionals don't have direct access to D365 F&O production environments (direct access is available only to sandboxes), which was from the beginning an important constraint imposed by the architecture, there are a few more challenges that need to be addressed when working with the data.

Case Sensitiveness

SQL Server is not case sensitive, therefore, depending on the channel though which the data came, values appear either in upper or lower case, respectively a mixture of both. Even if this isn't an issue in D365, it can become an issue when the data leave the environment. E.g., PowerQuery is case sensitive (while DAX is case insensitive), thus, if a field containing a mix of values participate in a join or aggregation, this will result in unexpected behavior (e.g., duplicates, records ignored). It's primarily the case of the Company (aka DataAreaId) field available in most of the important tables.

The ideal solution would be to make sure that the values are correct by design, however this can't be always enforced. Otherwise, when using the data outside of D365 F&O the solution would be to transform all the values in upper case (or lower case). However, also this step might occur too late. E.g., when the data are exported to the Azure Data Lake in parquet file format.

Unique Keys

A unique record in D365 F&O was in earlier versions usually identified by the RecId and DataAreaId, while later the Partition field was added. This means that most of the joins will need to consider all 3 columns, which adds some overhead. In some environments there's only a Partition defined (and thus the field can be ignored), however this is not a warranty. 

As long developers use SQL there's no issue of using multiple fields in JOINs, though in PowerQuery there must be created a unique key based on the respective records so the JOINs are possible. Actually, also SQL-based JOINs would benefit if each record would be identified by one field.

Audit Metadata

Not all tables have fields that designate the date when a record was created or last modified, respectively the user who performed the respective action. The fields can be added manually when setting up the system, however that's seldom done. This makes it difficult to audit the records and sometimes it's a challenge also for reporting, respectively for troubleshooting the differences between DWH and source system. Fortunately, the Export to Data Lake adds a timestamp reflecting the time when the record was synchronized, though it can be used then only for the records synchronized after the first load. 

Tables vs. Entities

Data are modified in D365 F&O via a collection of entities, which are nothing but views that encapsulate the business logic, being based on the base tables or other views, respectively a combination of both. The Export to Data Lake (*) is based on the tables, while Link to Data Lake is based on data entities. 

Using the base tables means that the developer must reengineer the logic from the views. For some cases it might work to create the entities as views in the DWH environment though some features might not be supported. It's the case of serverless and dedicated SQL pools, that support only a subset from the features available under standard Azure SQL Server. 

The developer can try to replicate the logic from entities, considering only the logic needed by the business, especially when only a subset from the functionality available in the entity was used. The newly created views can become thus more readable and maintainable. On the other side, if the logic in entity changed, the changes need to be reflected also in the DWH views. 

Using the entity-based data makes sure that the data are consistent between environments. Unfortunately, Microsoft found out that isn't so easy to synchronize the data at entity level. Moreover, there are multiple entities based on the same table that reflect only a subset of the columns or rows. Thus, to cover all the fields from a base table, one might be forced to synchronize multiple views, leading thus to data duplication.  

In theory, both mechanisms can be used within the same environment, even if this approach is against the unique source of truth principle, when data are duplicated. 

Data Validation in the Data Lake

One scenario in which both sources are useful is when validating whether the synchronization mechanism worked as expected. Thus, one can compare the number of records and check whether there are differences that can't be mitigated. However, does it make sense to "duplicate" database objects only for this purpose?

Ideally, to validate whether a record was synchronized should be done in the source environment (e.g. via a timestamp). That's difficult to achieve, especially when there's no direct access to the source database (as is the case for Production databases). Fortunately, Dataverse provides this functionality, even if might not be bullet proof. 

In extremis, the most reliable approach is to copy the production environment on a sandbox and do a count of records for each table, using as baseline for comparison the time when the refresh occurred.

Base Enum Values

The list of values that don't have their own tables are managed within the application as Base Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values. Some of the mappings between the codes and values are documented in two system tables, and even in old language-based documentation, though both sources are far from complete. As alternative, one can try to discover the values in the system. 

Unfortunately, the mappings need to be repeated when the Enum-based attributed is used in multiple places. One can reduce mapping's duplication by encapsulating the logic into a view (aka "base view") and reused accordingly (see the logic for TDM.vEcoResProduct).

Even if the values for many of the Enums are stored into the EnumValueTable table, Enum's name being available in EnumIdTable table, it's not a good idea to retrieve the values via a JOIN in the business logic. This would complicate the business logic unnecessarily. A CASE is more efficient even if occasionally more difficult to maintain. Unfortunately, there's no timestamp to identify which values were added lately.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

22 February 2023

SQL Reloaded: Automatic Statistics Creation & Dropping for CETAS based on CSV File Format in Serverless SQL Pool

Introduction

The serverless SQL pool query optimizer uses statistics to calculate and compare the cost of various query plans, and then choose the plan with the lowest cost. Automatic creation of statistics is turned on for parquet file format, though for CSV file format statistics will be automatically created only when OPENROWSET is used. This means that when creating CETAS based on CSV the statistics need to be created manually. 

This would be one more reason for holding the files in the Data Lake as parquet files. On the other side there are also many files already available in CSV format, respectively technoloqies that allows exporting data only/still as CSV. Moreover, transforming the files as parquet is not always technically feasible.

Using OPENROWSET could also help, though does it make sense to use a different mechanismus for the CSV file format? In some scenarios will do. I prefer to have a unitary design, when possible. Moreover, even if some columns are not needed, they can still be useful for certain scenarios (e.g. troubleshooting, reevaluating their use, etc.). 

There are files, especially the ones coming from ERPs (Enterprise Resource Planning) or similar systems, which have even a few hundred columns (on average between 50 and 100 columns). Manually creating  the statistics for the respective tables will cost lot of time and effort. To automate the process there are mainly three choices:
(1) Creating statistics for all the columns for a given set of tables (e.g. for a given schema).
(2) Finding a way to automatically identify the columns which are actually used.
(3) Storing the list of tables and columns on which statistics should be build (however the list needs to be maintained manually). 

Fortunately, (1) can be solved relatively easy, based on the available table metadata, however it's not the best solution, as lot of statistics will be unnecessarily created. (2) is possible under certain architectures or additional effort. (3) takes time, though it's also an approachable solution.

What do we need?

For building the solution, we need table and statistics metadata, and the good news is that the old SQL Server queries still work. To minimize code's repetition, it makes sense to encapsulate the logic in views. For table metadata one can use the sys.objects DMV as is more general (one can replace sys.objects with sys.tables to focus only on tables):

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectColumns

-- create view
CREATE OR ALTER VIEW dbo.vObjectColumns
AS 
-- object-based column metadata
SELECT sch.name + '.' + obj.name two_part_name
, sch.Name schema_name
, obj.name object_name
, col.name column_name
, obj.type
, CASE 
	WHEN col.is_ansi_padded = 1 and LEFT(udt.name , 1) = 'n' THEN col.max_length/2
	ELSE col.max_length
  END max_length
, col.precision 
, col.scale
, col.is_nullable 
, col.is_identity
, col.object_id
, col.column_id
, udt.name as data_type
, col.collation_name
, ROW_NUMBER() OVER(PARTITION BY col.object_id ORDER BY col.column_id) ranking FROM sys.columns col JOIN sys.types udt on col.user_type_id= udt.user_type_id JOIN sys.objects obj ON col.object_id = obj.object_id JOIN sys.schemas as sch on sch.schema_id = obj.schema_id -- testing the view SELECT obc.* FROM dbo.vObjectColumns obc WHERE obc.object_name LIKE '<table name>%' AND obc.schema_name = 'CRM' AND obc.type = 'U' ORDER BY obc.two_part_name , Ranking

The view can be used also as basis for getting the defined stats:

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectStats

-- create view 
CREATE OR ALTER VIEW dbo.vObjectStats
AS
-- object-based column statistics
SELECT obc.two_part_name + '.' + QuoteName(stt.name) three_part_name
, obc.two_part_name
, obc.schema_name
, obc.object_name
, obc.column_name
, stt.name stats_name
, STATS_DATE(stt.[object_id], stt.stats_id) AS last_updated
, stt.auto_created
, stt.user_created
, stt.no_recompute
, stt.has_filter 
, stt.filter_definition
, stt.is_temporary 
, stt.is_incremental 
, stt.auto_drop 
, stt.stats_generation_method_desc
, stt.[object_id]
, obc.type 
, stt.stats_id
, stc.stats_column_id
, stc.column_id
FROM dbo.vObjectColumns obc
     LEFT JOIN sys.stats_columns stc 
	   ON stc.object_id = obc.object_id
	  AND stc.column_id = obc.column_id 
          LEFT JOIN sys.stats stt
            ON stc.[object_id] = stt.[object_id] 
           AND stc.stats_id = stt.stats_id

-- testing the view 
SELECT * FROM dbo.vObjectStats obs WHERE (obs.auto_created = 1 OR obs.user_created = 1) AND obs.type = 'U' AND obs.object_name = '<table name>' ORDER BY obs.two_part_name , obs.column_id

Now we have a basis for the next step. However, before using the stored procedure define below, one should use the last query and check whether statistics were defined before on a table. Use for testing also a table for which you know that statistics are available.

Create Statistics

The code below is based on a similar stored procedure available in the Microsoft documentation (see [1]). It uses a table's column metadata, stores them in a temporary table and then looks through each record, create the DDL script and runs it:

-- drop procedure (for cleaning)
--DROP PROCEDURE dbo.pCreateStatistics

-- create stored procedure
CREATE OR ALTER PROCEDURE dbo.pCreateStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(100)
)
AS
-- creates statistics for serverless SQL pool
BEGIN
	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	  schema_name nvarchar(50)
	, table_name nvarchar(128)
	, column_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obc.schema_name
	, obc.object_name
	, obc.column_name 
	, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
	FROM dbo.vObjectColumns obc
	WHERE obc.type = 'U' -- tables
	  AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	  AND IsNull(@table_name, obc.object_name) = obc.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
		SET @query = (SELECT 'CREATE STATISTICS '+ QUOTENAME('stat_' + schema_name + '_' + table_name + '_' + column_name) + ' ON '+ QUOTENAME(schema_name) + '.' + QUOTENAME(table_name) + '(' + QUOTENAME(column_name) + ')' 
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'create failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END


-- test stored procedure (various scenario)
EXEC dbo.pCreateStatistics '<schema name>', '<table name>' -- based on schema & table
EXEC dbo.pCreateStatistics '<schema name>', NULL -- based on a schema
EXEC dbo.pCreateStatistics NULL, '<table name>' -- based on a table

Notes:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Please note that rerunning the stored procedure without deleting previously the statitics on the tables in scope will make the procedure raise failures for each column (behavior by design), though the error messages can be surpressed by commenting the code, if needed. One can introduce further validation, e.g. considering only the columns which don't have a statistic define on them.

Further Steps?

What can we do to improve the code? It would be great if we could find a way to identify the columns which are used in the queries. It is possible to retrieve the queries run in serverless SQL pool, however identifying the tables and columns from there or a similar source is not a straightforward solution. 

The design of views based on the external tables can help in the process! I prefer to build on top of the external tables a first level of views (aka "base views") that include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views are based solely on the external table and thus contain no joins. They can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. However, for traceability the name of the columns don't change! This means that if view's name is easily identifiable based on external table's name, we could check view's columns against the ones of the external table and create statistics only for the respective columns. Using a unique prefix (e.g. "v") to derive views' name from tables' name would do the trick.

To do that, we need to create a view that reflects the dependencies between objects (we'll be interested only in external tables vs views dependencies):

-- drop view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vObjectsReferenced

-- create view
CREATE OR ALTER VIEW dbo.vObjectsReferenced
AS 
-- retrieving the objects referenced 
SELECT QuoteName(sch.name) + '.' + QuoteName(obj.name) AS two_part_name 
, obj.object_id 
, obj.schema_id 
, sch.name schema_name 
, obj.name object_name 
, obj.type
, QuoteName(scr.name) + '.'+ QuoteName(sed.referenced_entity_name) AS ref_two_part_name 
, obr.object_id ref_object_id
, obj.schema_id ref_schema_id 
, scr.name ref_schema_name 
, obr.name ref_object_name 
, obr.type ref_type
FROM sys.sql_expression_dependencies sed 
     JOIN sys.objects obj
       ON obj.object_id = sed.referencing_id 
	      JOIN sys.schemas as sch
	        ON obj.schema_id = sch.schema_id 
	 JOIN sys.objects obr
	   ON sed.referenced_id = obr.object_id
	      JOIN sys.schemas as scr
	        ON obr.schema_id = scr.schema_id

-- testing the view
SELECT top 10 *
FROM dbo.vObjectsReferenced
WHERE ref_type = 'U'

With this, the query used above to fill the table becomes:

-- fill table query with column selection 
SELECT obc.schema_name
, obc.object_name
, obc.column_name 
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
	AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	AND IsNull(@table_name, obc.object_name) = obc.object_name
	AND EXISTS ( -- select only columns referenced in views
	    SELECT * 
		FROM dbo.vObjectsReferenced obr 
		    JOIN dbo.vAdminObjectColumns obt
			ON obr.object_id = obt.object_id 
		WHERE obt.type = 'V' -- view
		AND obr.object_name  =  'v' + obr.ref_object_name
		AND obc.object_id = obr.ref_object_id
		AND obc.column_name = obt.column_name);

This change will reduce the number of statistics created on average by 50-80%. Of course, there will be also cases in which further statistics need to be added manually. One can use this as input for an analysis of the columns used and store the metadata in a file, do changes to it and base on it statistics' creation. 

Drop Statistics

Dropping the indexes resumes to using the dbo.vObjectStats view created above for the schema and/or table provided as parameter. The logic is similar to statistics' creation:

-- drop stored procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS dbo.pDropStatistics

-- create procedure
CREATE OR ALTER PROCEDURE dbo.pDropStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(128)
)
AS
-- drop statistics for a schema and/or external table in serverless SQL pool
BEGIN

	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	 three_part_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obs.three_part_name
	, ROW_NUMBER() OVER(ORDER BY obs.three_part_name) ranking
	FROM dbo.vObjectStats obs
	WHERE obs.type = 'U' -- tables
	  AND IsNull(@schema_name, obs.schema_name) = obs.schema_name 
	  AND IsNull(@table_name, obs.object_name) = obs.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
   
		SET @query = (SELECT 'DROP STATISTICS ' + ddl.three_part_name
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'drop failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END

Note:
IMPORTANT!!!
I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Closing Thoughts

The solution for statistics' creation is not perfect, though it's a start! It would have been great if such a feature would be provided by Microsoft, and probably they will, given the importance of statistics of identifying an optimal plan. It would be intersting to understand how much statistics help in a distributed environment and what's the volume of data processed for this purpose. 

Please let me know if you found other workarounds for statistics's automation.

Happy coding!

References:
[1] Microsoft Learn (2022) Statistics in Synapse SQL (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.