Showing posts with label data mirroring. Show all posts
Showing posts with label data mirroring. Show all posts

10 November 2024

🏭🗒️Microsoft Fabric: Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric
Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Warehouse

  • {def} highly available relational data warehouse that can be used to store and query data in the Lakehouse
    • supports the full transactional T-SQL capabilities 
    • modernized version of the traditional data warehouse
    • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
  • resources are managed elastically to provide the best possible performance
    • ⇒ no need to think about indexing or distribution
    • a new parser gives enhanced CSV file ingestion time
    • metadata is cached in addition to data
    • improved assignment of compute resources to milliseconds
    • multi-TB result sets are streamed to the client
  • leverages a distributed query processing engine
    • provides with workloads that have a natural isolation boundary [3]
      • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
  • {operation} data ingestion
    • involves moving data from source systems into the data warehouse [2]
      • the data becomes available for analysis [1]
    • via Pipelines, Dataflows, cross-database querying, COPY INTO command
    • no need to copy data from the lakehouse to the data warehouse [1]
      • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
  • {operation} data storage
    • involves storing the data in a format that is optimized for analytics [2]
  • {operation} data processing
    • involves transforming the data into a format that is ready for consumption by analytical tools [1]
  • {operation} data analysis and delivery
    • involves analyzing the data to gain insights and delivering those insights to the business [1]
  • {operation} designing a warehouse (aka warehouse design)
    • standard warehouse design can be used
  • {operation} sharing a warehouse (aka warehouse sharing)
    • a way to provide users read access to the warehouse for downstream consumption
      • via SQL, Spark, or Power BI
    • the level of permissions can be customized to provide the appropriate level of access
  • {feature} mirroring 
    • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
      • any database can be accessed and managed centrally from within Fabric without having to switch database clients
      • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
  • {feature} v-order
    • write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engine [5]
  • {feature} caching
    • stores frequently accessed data and metadata in a faster storage layer [6]
  • {feature} snapshots
    • {def} read-only representation of a warehouse at a specific point in time [7]
  • {feature} automatic purging
    • routinely and systematically eliminating expired data periodically [8] 
    • {benefit} proactively helps maintain an efficient and cost-effective data infrastructure [8]
      • via garbage collection
        • background process, that periodically identifies and cleans [8] 
          • all the data and log files of dropped tables
          • aborted transactions
          • temporary tables
          • expired files
        • executes every 24 hours, when the warehouse is active [8] 
        • ensures the data warehouse remains optimized and efficient [8]
    • {goal} storage cost optimization
    • {goal} minimize maintenance overhead
    • {goal} adhering to data retention regulations
  • {concept} SQL analytics endpoint 
    • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
  • {concept} virtual warehouse
    • can containing data from virtually any source by using shortcuts [3]
  • {concept} cross database querying 
    • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]

    References:
    [1] Microsoft Learn (2023) Fabric: Get started with data warehouses in Microsoft Fabric (link
    [2] Microsoft Learn (2023) Fabric: Microsoft Fabric decision guide: choose a data store (link)
    [3] Microsoft Learn (2024) Fabric: What is data warehousing in Microsoft Fabric? (link)
    [4] Microsoft Learn (2023) Fabric: Better together: the lakehouse and warehouse (link)
    [5] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
    [6] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
    [7] Microsoft Learn (2024) Fabric: Warehouse Snapshots in Microsoft Fabric (Preview) [link]
    [8] Microsoft Fabric Updates Blog (2025) Intelligent Data Cleanup: Smart Purging for Smarter Data Warehouses [link

    Resources:
    [R1] Microsoft Learn (2023) Fabric: Data warehousing documentation in Microsoft Fabric (link)
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    ETL - Extract, Transfer, Load
    MF - Microsoft Fabric

    18 April 2024

    🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

    Data Warehousing
    Data Warehousing Series

    Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon. 

    For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work. 

    Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed. 

    Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done. 

    With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

    Mirroring in Microsoft Fabric
    Mirroring in Microsoft Fabric (adapted after [2])

    According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]. 

    The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

    Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
    [2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
    [3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
    [4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

    🏭🗒️Microsoft Fabric: Mirroring [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 18-Apr-2024

    Mirroring in Microsoft Fabric
    Mirroring in Microsoft Fabric (adapted after [2])

    [Microsoft Fabric] Mirroring

    • low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]
      • supported in Azure SQL Database, Azure Cosmos DB, Snowflake [1]
      • ⇒ replaces ETL pipelines
      • ⇐ manages the replication of data into OneLake and conversion to Parquet [1]
      • changes have a near real-time latency [3]
      • uses the source database’s CDC feature [4]
      • zero-ETL, fully managed, and high-performance data movement mechanism [8] 
    • {benefit} allows users to use a highly integrated, end-to-end, and easy-to-use service [1]
      • the mirrored data can be used in Power BI
        • ⇐ since tables are all v-ordered delta tables [3]
    • {benefit} allows users to accelerate their journey into Fabric [1]
      • the delta tables can be used in every Fabric experience
      • {restriction} replication of views is currently not supported [3]
    • {benefit} allows to write cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of Lakehouses in a single T-SQL query [1]
    • {benefit} free Mirroring compute and storage for replicas in OneLake [5]
      • storage for replicas is free up to a limit
        • based on the capacity size [5]
        • mirroring offers a free TB of mirroring storage for every CU purchased [5]
      • compute is free and not consume capacity 
        • ⇐ used to replicate data into OneLake [5]
    • landing zone in OneLake 
      • stores both the snapshot and change data [3]
        • ⇐ improves performance when converting files into delta verti-parquet [3]
    • {operation} enable the mirroring 
      • by creating a secure connection to the operational data source
      • an entire database or individual tables can be replicated 
      • creates three items in the targeted workspace
        • the mirrored database
        • a SQL analytics endpoint
        • a Default semantic model
      • the same source database can be mirrored multiple times
        • ⇐ though usually not needed
          • scenario: replicating data to different types of environments
    • {operation} stopping the mirroring
      • stops the replication in the source database, but a copy of the tables is kept in OneLake [3]
    • {operation} restarting the mirroring 
      • results in all data being replicated from the start [3]
    • {operation} remove a table from mirroring
      • the table is no longer replicated and its data is deleted from OneLake [3]
    • {operation} sharing
      • users grant other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]
        • access is also granted to the SQL analytics endpoint and associated default semantic model [1]
        • shared mirrored databases can be found through either
          • Data Hub
          • Shared with Me section in Microsoft Fabric
      • triggers an initial replication
    • {requirement} [licensing] requires Power BI Premium, Fabric Capacity, or Trial Capacity
    • {feature} monitoring
      • {benefit} allows to gain insights into mirroring operations and when the replica in Fabric OneLake was last refreshed [4]
    • [Azure SQL Database] 
      • {restriction} doesn't support Azure SQL Database logical servers behind an AVN or private networking [2]
        • {requirement} update the Azure SQL logical server firewall rules to Allow public network access [2]
        • {requirement} enable the Allow Azure services option to connect to your Azure SQL Database logical server [2]
      • {restriction}
        •  access through the Power BI Gateway or behind a firewall is unsupported [3]
      • authentication
        • SQL authentication with user name and password
        • Microsoft Entra ID
        • Service Principal
      • {troubleshooting} check if the changes properly flow
        • via sys.dm_change_feed_log_scan_sessions DMV
      • {troubleshooting} check if there are any problems reported
        • via sys.dm_change_feed_errors DMV
      • {troubleshooting} check if the mirroring was properly enabled
        • via sp_help_change_feed stored procedure 
      • {troubleshooting} disable mirroring
        • via sp_change_feed_disable_db stored procedure
    • [SQL Server] [preview] 
      • creates an initial snapshot in Fabric OneLake after which data is kept in sync in near-real time [7]
      • [SQL Server 2016-2022] 
        • relies on CDC to capture an initial snapshot of all the tables selected for mirroring and there after replicate the changes [7]
          • the mirroring services connects to read the initial snapshot as well as the changes and pulls the data into OneLake and converts into an analytics-ready format in Fabric [7]
        • via on-premises data gateway (OPDG) 
          • must be installed in the SQL Server environment [7]
        • applies to VMs and on-premises altogether [5] 
      • [SQL Server 2025] supports the latest change feed technology for replicating changes to ensure that there is minimal impact to operational workload [5]
        • keeps track and replicates the initial snapshot and changes to the landing zone in OneLake which is then converted to an analytics-ready format by the mirroring engine in Fabric [7]
        • {requirement} on-premises data gateway
          • primarily used as a control plane to connect and authenticate the on-premises environment to Fabric [7]
        • {requirement} Arc Agent
          • outbounds authentication from SQL Server to Fabric [7]
    • [Azure SQL MI]
      • {feature|preview} firewall support
        • via data gateway
          • ensures secure connections to the source databases via private endpoint [5]
            • removes the necessity of opening public access [5]
      • {feature|preview} mirror tables without PKs 
        • {benefit} increased flexibility
      • {feature|preview} expanded DDL
        •  allows to truncate tables while mirroring is actively replicating on data [5]
    • {feature} open mirroring
      • extension of Mirroring capabilities in Fabric [5]
        • creates a copy of the data in OneLake and keeps it up to date [6]
        • provides APIs to replicate data from anywhere
        • data is converted to parquet or CSV format and the API or UI is used to load the data into the OneLake landing zone along with any additional changes [6]
        • the Fabric replication technology converts everything to a Delta format making it optimized and ready for AI and BI workloads [6]
      • {benefit} empowers vendors to build their own Mirroring solutions 
        • leverages the same Mirroring engine to help manage inserts/updates/deletes efficiently [5]
      • cheaper for read-heavy data sources [8]
      • hreat for unique data requirements, regulated environments [8]
    • {scenario} analyze operational data in near real-time [8]
    • {scenario} avoid managing ETL pipelines [8]
    • {scenario} single source of truth across both operational and analytical systems [8]
    • {scenario} building a Lakehouse architecture and want real-time updates [8]
    Acronyms:
    AVN - Azure Virtual Network
    CDC - Change Data Capture
    DDL - Data Definition Language
    ETL - Extract, Transfer, Load
    DMV - Dynamic Management View
    MI - Managed Instance
    PK - Primary Key
    TB - terabyte

    References:
    [1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? [link]
    [2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] [link]
    [3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] [link]
    [4] Microsoft Fabric Updates Blog (2023) Introducing Mirroring in Microsoft Fabric [link]
    [5] Microsoft Fabric Updates Blog (2025) What’s new with Mirroring in Fabric at Microsoft Build 2025 [link]
    [6] Microsoft Fabric Updates Blog (2025) Announcing the General Availability of Open Mirroring [link]
    [7] Microsoft Fabric Updates Blog (2025) Mirroring for SQL Server in Microsoft Fabric (Preview) [link]

    [8] Microsoft Fabric Updates Blog (2025) Mirroring in Microsoft Fabric explained: benefits, use cases, and pricing demystified [link] 

    Resources:
    [R1] Tutorial: Configure Microsoft Fabric mirrored databases from Azure SQL Database [Preview] (link)
    [R2] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)
    [R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R4] Microsoft (2025) Fabric Roadmap [link]

    03 June 2012

    📦Data Migrations (DM): What is Data Migration?

    Data Migration
    Data Migrations Series

    If you are working in a data-centric business it’s almost impossible for the average worker not to have heard this term, even tangentially. Considering the meaning of “migration” - the act or process of moving from one place to another - the intuition might even tell what data migration is about: the process of moving data from one place to another. It’s pretty basic, isn’t it? Now as data are moved over and over again between various places, for example the various layers of an applications, between databases, between media storage devices, and so on, we need some precision in defining the term because not all these can be considered as data migration examples. Actually we can talk about data copying or data movement without speaking of data migration. So, what is data migration? Here are a few takes on defining data migration:

    process of transferring data from one platform or operating system to another” (Babylon)

    "Data migration is the process of transferring data between storage types, formats, or computer systems." (Wikipedia)

     "Data migration is the movement of legacy data to new media and technologies as the older ones are displaced." (Toolbox)

     “The purpose of data migration is to transfer existing data to the new environment.” (Talend)

     “Data Migration is the process of moving data from one or more sources into a target application” (Utopia Inc.)

     “[…] is the one off selection, preparation and transportation of appropriate data, of the right quality, to the right place, at the right time.(J. Morris)

    Resuming the above definitions, data migration can be defined as “the process of selecting, assessing, converting, preparing, validating and moving data from one or more information systems to another system”. The definition isn’t at all perfect, first of all because some of the terms need further explanation, secondly because any of the steps may be skip or other important steps can be identified in the process, and thirdly because further clarifications are needed. Anyway, it offers some precision, and at least for this reason, could be preferred to the above definitions.

    So, resuming, data migration supposes the movement of data from one or more information systems, referred as source systems, to another one, the target system. Typically the new system replaces the old systems, they being retired, or they can continue to be used with reduced scope, for example for reporting purposes or . Even if performed in stages, the movement is typically one time activity, so everything has to be perfect. That’s the purpose of the other steps – to minimize the risks of something going wrong. The choice of steps and their complexity depends on the type of information systems involved, on the degree of resemblance between source and target, business needs, etc.

    As mentioned above, not everything that involves data movement can be considered as data migration. For example data integration involves the movement and combination of data from various information systems in order to provide a unified view. Data synchronization involves the movement of data in order to reflect the changes of data in one information system into another, when data from the two systems need to be consistent. Data mirroring involves the synchronization of data, though it involves an exact copy of the data, the mirroring occurring continuously in real time. Data backup involves the movement/copy of data at a given point in time for eventual restore in case of data loss. Data transfer refers to the movement of row data between the layers of information systems. To make things even fuzzier, these types of data movements can be considered in a data migration too, as data need to be locally integrated, synchronized, transferred, mirrored or back up. Data migration is overall a complex thematic.

    Previous Post <<||>> Next Post
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.