SQL Troubles: Synchronization

Showing posts with label Synchronization. Show all posts

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

{def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
- enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
- created by default for each lakehouses and mirrored databases provisioned in a workspace
- manages the automatically generated tables so the workspace users can't modify them [1]
{feature} a background process is responsible for

scanning lakehouse for changes [1]

automatically generates a table in the appropriate schema for every Delta table in the Lakehouse

tables are created with a minor delay [1]
data is refreshed automatically

the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]

{alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]

keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]

{characteristic} transparently managed

⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
when a change is detected in the lakehouse

a background process updates metadata [1]
the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]

{characteristic} low latency

under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
the actual length of time can vary from a few seconds to minutes [1]

the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]

ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]

⇐ provides a seamless experience [3]

{feature} can be enriched by adding database objects

schemas
views
procedures
other database objects

{feature} automatic metadata discovery

tracks changes committed to lakehouses [1]

is a single instance per Fabric workspace [1]
{issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]

{resolution} migrate each lakehouse to a separate workspace [1]

allows automatic metadata discovery to scale [1]

{issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint

create a new table in the [1]
many committed rows may not be visible in the endpoint [1]
⇐ may be impacted by ETL processing that generate large volumes of changes
{recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]

forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]

{issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]

{recommendation}schedule regular lakehouse table maintenance operations [1]

{limitation} not all Delta features are understood by the automatic sync process [1]

{feature} endpoint reprovisioning

if the initial provisioning attempt fails, users have the option to try again [2]

⇐ without the need to create an entirely new lakehouse [2]
{benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

23 January 2010

🧊🎡Data Warehousing: ETL (Part I: It's All about Synchronization)

Data Warehousing Series

In a previous post on ETL: SSIS packages vs. SQL code I mentioned that there are three types of synchronization - of scope, business logic and data. The first type targets synchronizing the filters used to select the data, the second to synchronize the logic used in data processing, when is the case, and third to work with the same unaltered copy of the data.

The synchronization of scope is achieved by enforcing the same set of constraints to related data elements with the purpose of keeping the referential integrity between the various data elements. They involve parent-child relations like Purchase orders (PO) Headers/Lines or reference/referenced relations like PO Lines and Products or Vendors. Such relationships can involve datasets coming from distinct systems, usually involving different architecture - e.g., Products or Bill of Materials master data that could come from a Product Management Information (PMI) system.

The synchronization of business logic is usually a consequence of the synchronization of scope and concerns the set of business rules that must be applied in the logic used. For example, is need to consider Products or Vendors only for the open POs. Therefore, that would involve the duplication of logic for open POs across multiple database objects, logic that needs to be synchronized accordingly. One can reference directly the object encapsulating the logic for open POs, though that would create a recursive reference as the logic references the Products and/or Vendors as well. With one or two tricks this can be avoided.

Synchronizations can involve local logic, e.g. using exchange rates conversions and other transformations where a unique user-defined function could allow achieve this. No matter of the techniques used, one must make sure that all the logic is kept in synch.

The synchronization of data requires working with the same unaltered data in the various threads of processing, either SSIS packages or SQL code; that's quite simple as concept, though not always straightforward to achieve and occasionally ignored. For this purpose, it makes sense to load the data into an intermediary database or staging area, even for slowly changes data sources like data warehouses, and base the business logic on the respective local dataset(s) rather than loading the data repeatedly for each package from the source, which can also involve considerable additional network traffic. Why the alternative is not the best approach?

Supposing that two packages A and B are scheduled to run at the same time, even if the requests are sent simultaneously to the database, the simultaneously is relative, because most of the time the requests are queued, in general following to be processed in the order they came, though this depends on legacy system’s architecture and settings. Supposing that there is some time elapsed between the times the two requests are processed, there are good chances that one record was created, deleted, or updated in between.

The impact of changes in data could be minimum though strange situations might result with unpredictable impact. There are chances to find the issue when loading the data in the destination system, this if adequate validation is performed, though there are good chances for the issues to remain undiscovered until later, with all consequences resulting from this. Therefore, one should also build validation logic separately when feasible.

It should be targeted to cover all three types of synchronization even if it complicates the design of the solutions, as it's recommended in general to apply defensive architecting/programming. The increase in complexity is relative if one considers the effort needed for troubleshooting, plumbing or even redesign that needs to be done to fix the issues.

Previous Post <<||>> Next Post

Created: Jan-2010, Last Reviewed: Mar-2024

06 February 2007

🌁Software Engineering: Synchronization (Definitions)

"Enforcing constraints on the ordering of events occurring in different UEs. This is primarily used to ensure that shared resources are accessed by a collection of UEs in such a way that the program is correct regardless of how the UEs are scheduled." (Beverly A Sanders, "Patterns for Parallel Programming", 2004)

"Coordination in the execution of multiple threads. The most common cases of synchronization occur when you provide mutually exclusive access to shared resources or gather all threads at a point in the code before they are allowed to proceed." (Clay Breshears, "The Art of Concurrency", 2009)

"Timekeeping which requires the coordination of events to operate a system in unison." (Chen Liu et al, "Simultaneous MultiThreading Microarchitecture", 2010)

"The coordination, of tasks or threads, in order to obtain the desired runtime order. Commonly used to avoid undesired race conditions." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A technique for coordinating threads or processes to have appropriate execution order." (Masoud Hemmatpour et al, "Cost Evaluation of Synchronization Algorithms for Multicore Architectures", 2018)

"In multiprocessing systems, joining multiple independent processes in order to reach an agreement or commit to a certain sequence of actions." (Gen'ichi Yasuda, "A Formal Approach to the Distributed Software Control for Automated Multi-Axis Manufacturing Machines", 2018)

"Process of coordinating two or more activities in time." (Dharmendra T Patel, "Distributed Computing for Internet of Things (IoT)", 2019)

SQL Troubles

Pages