Showing posts with label lakehouse. Show all posts
Showing posts with label lakehouse. Show all posts

26 March 2025

🏭🗒️Microsoft Fabric: External Data Sharing [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Mar-2025

External data sharing
External data sharing [1]

[Microsoft Fabric] External data sharing 

  • {def} feature that enables Fabric users to share data from their tenant with users in another Fabric tenant (aka cross-tenant sharing) [1]
    • the data is shared in-place from OneLake storage locations in the sharer's tenant [1]
      • ⇒ no data is actually copied to the other tenant [1]
      • creates a OneLake shortcut in the other tenant that points back to the original data in the sharer's tenant [1]
      • data is exposed as read-only [1]
      • data can be consumed by any OneLake compatible Fabric workload in that tenant [1]
    • {benefit} allows for efficient and secure data sharing without duplicating data
      • the shared data remains read-only for the consumer, ensuring data integrity and consistency [2]
      • multiple tables and folders can be shared at once [2]
  • {prerequisite} Fabric admins must turn on external data sharing both in the sharer's tenant and in the external tenant
    • by specifying who can create and accept external data shares [1]
    • users can share data residing in tables or files within supported Fabric items [1]
      • require standard Fabric read and reshare permissions for the item being shared [1]
      • the user creating the share invites a user from another tenant with a link to accept the external data share [1]
        • upon accepting the share, the recipient chooses a lakehouse where a shortcut to the shared data will be created [1]
        • the links work only for users in external tenants
          • for sharing data within the same OneLake storage accounts with users in the same tenant, use OneLake shortcuts [1]
        • {limitation} shortcuts contained in folders that are shared via external data sharing won't resolve in the consumer tenant [1]
    • access is enabled via a dedicated Fabric-to-Fabric authentication mechanism 
    • ⇐ doesn’t require Entra B2B guest user access [1]
  • {operation} create an external data share in the provider tenant)
    • external data shares can be created for tables or files in lakehouses and warehouses, and in KQL, SQL, mirrored databases [1]
    • {limitation} the sharer can't control who has access to the data in the consumer's tenant [1]
  • {operation} accept an external data share in consuming tenant)
    • only lakehouses can be chosen for the operation
    • the consumer can grant access to the data to anyone [1]
      • incl. guest users from outside the consumer's organization [1]
    • data can be transferred across geographic boundaries when it's accessed within the consumer's tenant [1]
  • {operation} revoke extern data shares
    • any user in the sharing tenant with read and reshare permissions on an externally shared item can revoke the external data share at any time [1]
      • via Manage permissions >> External data shares tab
      • can be performed of any item the user has read and reshare permissions on [3]
      • {warning}a revoked external data share can't be restored [3] 
        • irreversibly severs all access from the receiving tenant to the shared data [3]
        • a new external data share can be created instead [3]
  • applies to
    • lakehouse
      • an entire lakehouse schema can be shared [2]
        • shares all the tables in the schema [2]
        • any changes to the schema are immediately reflected in the consumer’s lakehouse [2]
    • mirrored database
    • KQL database
    • OneLake catalog
  • can be consumed via 
    • Spark workloads
      • notebooks or Spark
    • lakehouse SQL Analytics Endpoint
    • semantic models
    • ⇐ data can be shared from a provider and consumed in-place via SQL queries or in a Power BI report [2]
  • {feature} external data sharing APIs
    • support service principals for admin and user operations
      • can be used to automate the creation or management of shares [2]
        • supports service principals or managed identity [2]
    • {planned} data warehouse support
      • share schemas or tables from a data warehouse [2]
    • {planned} shortcut sharing
      • share OneLake shortcuts using external data sharing [2]
      • ⇒ data residing outside of Fabric will be externally shareable
        • S3, ADLS Gen2, or any other supported shortcut location [2]
    • {planned} consumer APIs
      • consumer activities, including viewing share details and accepting shares, will soon be available via API [2]

References:
[1] Microsoft Learn (2024) External data sharing in Microsoft Fabric [link]
[2] Microsoft Fabric Updates Blog (2025) External data sharing enhancements out now [link]
[3] Microsoft Learn (2024) Fabric: Manage external data shares [link]

Resources:
[R1] Analytics on Azure Blog (2024) External Data Sharing With Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Storage
API - Application Programming Interface
B2B - business-to-business
KQL - Kusto Query Language

16 March 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part XII: Databases)

After reviewing the server and database properties, the next step is to review the database configurations using the sys.databases metadata. There's a such object even for warehouse and lakehouse, respectively for mirrored databases.

Except the last line from the query, all the other attributes are supported by all above mentioned repositories. 

-- database metadata
SELECT db.name
, db.database_id
--, db.source_database_id
, db.owner_sid
, db.create_date
, db.compatibility_level
, db.collation_name
--, db.user_access
, db.user_access_desc
--, db.is_read_only
--, db.is_auto_close_on
--, db.is_auto_shrink_on
--, db.state
, db.state_desc
--, db.is_in_standby
--, db.is_cleanly_shutdown
--, db.is_supplemental_logging_enabled
--, db.snapshot_isolation_state
, db.snapshot_isolation_state_desc
--, db.is_read_committed_snapshot_on
--, db.recovery_model
, db.recovery_model_desc
, db.page_verify_option
, db.page_verify_option_desc
, db.is_auto_create_stats_on
--, db.is_auto_create_stats_incremental_on
, db.is_auto_update_stats_on
--, db.is_auto_update_stats_async_on
--, db.is_ansi_null_default_on
--, db.is_ansi_nulls_on
--, db.is_ansi_padding_on
--, db.is_ansi_warnings_on
--, db.is_arithabort_on
--, db.is_concat_null_yields_null_on
--, db.is_numeric_roundabort_on
--, db.is_quoted_identifier_on
--, db.is_recursive_triggers_on
--, db.is_cursor_close_on_commit_on
--, db.is_local_cursor_default
, db.is_fulltext_enabled
--, db.is_trustworthy_on
, db.is_db_chaining_on
--, db.is_parameterization_forced
, db.is_master_key_encrypted_by_server
--, db.is_query_store_on
--, db.is_published
--, db.is_subscribed
--, db.is_merge_published
--, db.is_distributor
--, db.is_sync_with_backup
, db.service_broker_guid
--, db.is_broker_enabled
--, db.log_reuse_wait
, db.log_reuse_wait_desc
--, db.is_date_correlation_on
--, db.is_cdc_enabled
--, db.is_encrypted
--, db.is_honor_broker_priority_on
--, db.replica_id
--, db.group_database_id
--, db.resource_pool_id
--, db.default_language_lcid
--, db.default_language_name
--, db.default_fulltext_language_lcid
--, db.default_fulltext_language_name
--, db.is_nested_triggers_on
--, db.is_transform_noise_words_on
--, db.two_digit_year_cutoff
--, db.containment
--, db.containment_desc
, db.target_recovery_time_in_seconds
--, db.delayed_durability
--, db.delayed_durability_desc
--, db.is_memory_optimized_elevate_to_snapshot_on
--, db.is_federation_member
--, db.is_remote_data_archive_enabled
--, db.is_mixed_page_allocation_on
, db.is_temporal_history_retention_enabled
--, db.catalog_collation_type
, db.catalog_collation_type_desc
, db.physical_database_name
--, db.is_result_set_caching_on
, db.is_accelerated_database_recovery_on
--, db.is_tempdb_spill_to_remote_store
--, db.is_stale_page_detection_on
, db.is_memory_optimized_enabled
, db.is_data_retention_enabled
--, db.is_ledger_on
--, db.is_change_feed_enabled
--, db.is_data_lake_replication_enabled
--, db.is_change_streams_enabled
--, db.data_lake_log_publishing
, db.data_lake_log_publishing_desc
, db.is_vorder_enabled
--, db.is_optimized_locking_on
FROM sys.databases db

Output (consolidated):

Attribute Warehouse Mirrored Db Lakehouse SQL database
name Warehouse Test 001 MirroredDatabase_1 Lakehouse_DWH SQL DB Test...
database_id 5 6 7 28
owner_sid 0x01010000000000051... 0x01010000000000051... x01010000000000051... AAAAAAWQAAAAAA...
create_date 2025-02-22T18:56:28.700 2025-03-16T14:57:56.600 2025-03-16T15:07:59.563 2025-02-22T03:01:53.7130000
compatibility_level 160 160 160 160
collation_name Latin1_General_100_BIN2_UTF8 Latin1_General_100_BIN2_UTF8 Latin1_General_100_BIN2_UTF8 SQL_Latin1_General_CP1_CI_AS
user_access_desc MULTI_USER MULTI_USER MULTI_USER MULTI_USER
state_desc ONLINE ONLINE ONLINE ONLINE
snapshot_isolation_state_desc ON ON ON ON
recovery_model_desc SIMPLE SIMPLE SIMPLE FULL
page_verify_option 0 0 0 2
page_verify_option_desc NONE NONE NONE CHECKSUM
is_auto_create_stats_on 1 1 1 1
is_auto_update_stats_on 1 1 1 1
is_fulltext_enabled 1 1 1 1
is_db_chaining_on 0 0 0 0
is_master_key_encrypted_by_server 0 0 0 0
service_broker_guid 1F2261FC-5031-... 7D882362-567E-... 0D8938AB-BA79-... 2b74bed3-4405-...
log_reuse_wait_desc NOTHING NOTHING NOTHING NOTHING
target_recovery_time_in_seconds 60 60 60 60
is_temporal_history_retention_enabled 1 1 1 1
catalog_collation_type_desc DATABASE_DEFAULT DATABASE_DEFAULT DATABASE_DEFAULT SQL_Latin1_General_CP1_CI_AS
physical_database_name Warehouse Test 001 MirroredDatabase_1 Lakehouse_DWH 3f4a3e79-e53e-...
is_accelerated_database_recovery_on 1 1 1 1
is_memory_optimized_enabled 1 1 1 1
is_data_retention_enabled 1 1 1 1
data_lake_log_publishing_desc AUTO UNSUPPORTED UNSUPPORTED UNSUPPORTED
is_vorder_enabled 1 1 1 0

The output for the SQL database was slightly different formatted and is_vorder_enabled is not available. Otherwise, the above query can be used for all environments. 

All attributes except the last two are known from the earlier versions of SQL Server. is_vorder_enabled reflects the current status of V-Order [2], while data_lake_log_publishing_desc reflects the current state of Delta Lake log publishing [3].

Consolidating the output from different sources helps identify the differences and one can easily use Excel formulas for this.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2025) SQL Server 2022: sys.databases (Transact-SQL) [link]

[2] Microsoft Learn (2025) SQL Server 2022: Disable V-Order on Warehouse in Microsoft Fabric [link

[3] Microsoft Learn (2025) SQL Server 2022: Delta Lake logs in Warehouse in Microsoft Fabric [link

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

  • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
    • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
    • created by default for each lakehouses and mirrored databases provisioned in a workspace
      • every lakehouse has one SQL analytics endpoint [1]
    • manages the automatically generated tables so the workspace users can't modify them [1]
  • {feature} a background process is responsible for 
    • scanning lakehouse for changes [1]
      • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
        • tables are created with a minor delay [1]
        • data is refreshed automatically
          • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
        • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
    • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
      • {characteristic} transparently managed
        • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
        • when a change is detected in the lakehouse
          • a background process updates metadata [1]
          • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
      • {characteristic} low latency
        • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
        • the actual length of time can vary from a few seconds to minutes [1]
      • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
        • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
          •  ⇐ provides a seamless experience [3]
  • {feature} can be enriched by adding database objects
    • schemas
    • views
    • procedures
    • other database objects
  • {feature} automatic metadata discovery
    • tracks changes committed to lakehouses [1]
      • is a single instance per Fabric workspace [1]
      • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
        • {resolution} migrate each lakehouse to a separate workspace [1]
          • allows automatic metadata discovery to scale [1]
      • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
        • create a new table in the  [1]
        • many committed rows may not be visible in the endpoint [1]
        • ⇐ may be impacted by ETL processing that generate large volumes of changes
        • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
          • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
      • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
        • {recommendation}schedule regular lakehouse table maintenance operations [1]
      • {limitation} not all Delta features are understood by the automatic sync process [1]
  • {feature} endpoint reprovisioning
    • if the initial provisioning attempt fails, users have the option to try again [2]
      • ⇐ without the need to create an entirely new lakehouse [2]
      • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

Business Intelligence Series
Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider: 

Area Lakehouse Warehouse Eventhouse Fabric SQL database Power BI Datamart
Data volume Unlimited Unlimited Unlimited 4 TB Up to 100 GB
Type of data Unstructured, semi-structured, structured Structured, semi-structured (JSON) Unstructured, semi-structured, structured Structured, semi-structured, unstructured Structured
Primary developer persona Data engineer, data scientist Data warehouse developer, data architect, data engineer, database developer App developer, data scientist, data engineer AI developer, App developer, database developer, DB admin Data scientist, data analyst
Primary dev skill Spark (Scala, PySpark, Spark SQL, R) SQL No code, KQL, SQL SQL No code, SQL
Data organized by Folders and files, databases, and tables Databases, schemas, and tables Databases, schemas, and tables Databases, schemas, tables Database, tables, queries
Read operations Spark, T-SQL T-SQL, Spark* KQL, T-SQL, Spark T-SQL Spark, T-SQL
Write operations Spark (Scala, PySpark, Spark SQL, R) T-SQL KQL, Spark, connector ecosystem T-SQL Dataflows, T-SQL
Multi-table transactions No Yes Yes, for multi-table ingestion Yes, full ACID compliance No
Primary development interface Spark notebooks, Spark job definitions SQL scripts KQL Queryset, KQL Database SQL scripts Power BI
Security RLS, CLS**, table level (T-SQL), none for Spark Object level, RLS, CLS, DDL/DML, dynamic data masking RLS Object level, RLS, CLS, DDL/DML, dynamic data masking Built-in RLS editor
Access data via shortcuts Yes Yes Yes Yes No
Can be a source for shortcuts Yes (files and tables) Yes (tables) Yes Yes (tables) No
Query across items Yes Yes Yes Yes No
Advanced analytics Interface for large-scale data processing, built-in data parallelism, and fault tolerance Interface for large-scale data processing, built-in data parallelism, and fault tolerance Time Series native elements, full geo-spatial and query capabilities T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics Interface for data processing with automated performance tuning
Advanced formatting support Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format Full indexing for free text and semi-structured data like JSON Table support for OLTP, JSON, vector, graph, XML, spatial, key-value Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency Available instantly for querying Available instantly for querying Queued ingestion, streaming ingestion has a couple of seconds latency Available instantly for querying Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag. 

The warehouse seems to be more versatile, though careful attention needs to be given to its design. 

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas. 

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. Starting with 28th of March 2025, SQL databases will be ON by default and tenant admins must manually turn them OFF before the respective date [3].

Power BI datamarts have been in preview for a couple of years.


References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]
[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]
[3] Microsoft Fabric Update Blog (2025) Extending flexibility: default checkbox changes on tenant settings for SQL database in Fabric [link]
[4] Microsoft Fabric Update Blog (2025) Enhancing SQL database in Fabric: share your feedback and shape the future [link]
[5] Microsoft Fabric Update Blog (2025) Why SQL database in Fabric is the best choice for low-code/no-code Developers [link

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2025

[Microsoft Fabric] Shortcut

  • {def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access
    • {goal} unifies existing data without copying or moving it [2]
      • ⇒ data can be used multiple times without being duplicated [2]
      • {benefit} helps to eliminate edge copies of data [1]
      • {benefit} reduces process latency associated with data copies and staging [1]
    • is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]
      • ⇒ allows creating a single virtual data lake for the entire enterprise [1]
      • ⇐ available in all Fabric experiences [1]
      • ⇐ behave like symbolic links [1]
    • independent object from the target [1]
    • appear as folder [1]
    • can be used by workloads or services that have access to OneLake [1]
    • transparent to any service accessing data through the OneLake API [1]
    • can point to 
      • OneLake locations
      • ADLS Gen2 storage accounts
      • Amazon S3 storage accounts
      • Dataverse
      • on-premises or network-restricted locations via PDF 
  • {capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
  • {capability} data can be compose throughout OneLake without any data movement [2]
  • {capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]
    • ⇐ makes OneLake the first multi-cloud data lake [2]
  • {capability} provides support for industry standard APIs
    • ⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]
  • {operation} creating a shortcut
    • can be created in 
      • lakehouses
      • KQL databases
        • ⇐ shortcuts are recognized as external tables [1]
    • can be created via 
      • Fabric UI 
      • REST API
    • can be created across items [1]
      • the item types don't need to match [1]
        • e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]
    • [lakehouse] tables folder
      • represents the managed portion of the lakehouse 
        • shortcuts can be created only at the top level [1]
          • ⇒ shortcuts aren't supported in other subdirectories [1]
        • if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]
    • [lakehouse] files folder
      • represents the unmanaged portion of the lakehouse [1]
      • there are no restrictions on where shortcuts can be created [1]
        • ⇒ can be created at any level of the folder hierarchy [1]
        • ⇐table discovery doesn't happen in the Files folder [1]
    • [lakehouse] all shortcuts are accessed in a delegated mode when querying through the SQL analytics endpoint [5]
      • the delegated identity is the Fabric user that owns the lakehouse [5]
        • {default} the owner is the user that created the lakehouse and SQL analytics endpoint [5]
          •  ⇐ can be changed in select cases 
          • the current owner is displayed in the Owner column in Fabric when viewing the item in the workspace item list
        • ⇒ the querying user is able to read from shortcut tables if the owner has access to the underlying data, not the user executing the query [5]
          • ⇐ the querying user only needs access to select from the shortcut table [5]
      •  {feature} OneLake data access roles 
        • {enabled} access to a shortcut is determined by whether the SQL analytics endpoint owner has access to see the target lakehouse and read the table through a OneLake data access role [5]
        • {disabled} shortcut access is determined by whether the SQL analytics endpoint owner has the Read and ReadAll permission on the target path [5]
  • {operation} renaming a shortcut
  • {operation} moving a shortcut
  • {operation} deleting a shortcut 
    • doesn't affect the target [1]
      • ⇐ only the shortcut object is deleted [1]
    • shortcuts don't perform cascading deletes [1]
    • moving, renaming, or deleting a target path can break the shortcut [1]
  • {operation} delete file/folder
    • file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]
  • {permissions} users must have permissions in the target location to read the data [1]
    • when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1]
    • when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]
      •  the calling item owner’s identity is passed instead, delegating access to the calling user [1]
    • OneLake manages all permissions and credentials
  • {feature} shortcut caching 
    • {def} mechanism used to reduce egress costs associated with cross-cloud data access [1]
      • when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]
        • subsequent read requests are served from cache rather than the remote storage provider [1]
        • cached files have a retention period of 24 hours
        • each time the file is accessed the retention period is reset [1]
        • if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
        • if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]
    • {restriction} individual files greater than 1 GB in size are not cached [1]
    • {restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]
  • {feature} query acceleration
    • caches data as it lands in OneLake, providing performance comparable to ingesting data in Eventhouse [4]
  • {limitation} maximum number of shortcuts [1] 
    • per Fabric item: 100,000
    • in a single OneLake path: 10
    • direct shortcuts to shortcut links: 5
  • {limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
  • {limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
  • {limitation} shortcuts don't support non-Latin characters[1]
  • {limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
  • {limitation} copy function doesn't work on shortcuts that directly point to ADLS containers
    • {recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]
  • {limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
  • {limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
  • {limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]
References:
[1] Microsoft Learn (2024) Fabric: OneLake shortcuts [link]
[2] Microsoft Learn (2024) Fabric Analyst in a Day [course notes]
[3] Microsoft Learn (2024) Use OneLake shortcuts to access data across capacities: Even when the producing capacity is paused! [link]
[4] Microsoft Learn (2024) Fabric: Query acceleration for OneLake shortcuts - overview (preview) [link]
[5] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

Acronyms:
ADLS - Azure Data Lake Storage
API - Application Programming Interface
AWS - Amazon Web Services
GCS - Google Cloud Storage
KQL - Kusto Query Language
OPDG - on-premises data gateway

10 November 2024

🏭🗒️Microsoft Fabric: Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric
Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Warehouse

  • {def} highly available relational data warehouse that can be used to store and query data in the Lakehouse
    • supports the full transactional T-SQL capabilities 
    • modernized version of the traditional data warehouse
    • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
  • resources are managed elastically to provide the best possible performance
    • ⇒ no need to think about indexing or distribution
    • a new parser gives enhanced CSV file ingestion time
    • metadata is cached in addition to data
    • improved assignment of compute resources to milliseconds
    • multi-TB result sets are streamed to the client
  • leverages a distributed query processing engine
    • provides with workloads that have a natural isolation boundary [3]
      • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
  • {operation} data ingestion
    • involves moving data from source systems into the data warehouse [2]
      • the data becomes available for analysis [1]
    • via Pipelines, Dataflows, cross-database querying, COPY INTO command
    • no need to copy data from the lakehouse to the data warehouse [1]
      • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
  • {operation} data storage
    • involves storing the data in a format that is optimized for analytics [2]
  • {operation} data processing
    • involves transforming the data into a format that is ready for consumption by analytical tools [1]
  • {operation} data analysis and delivery
    • involves analyzing the data to gain insights and delivering those insights to the business [1]
  • {operation} designing a warehouse (aka warehouse design)
    • standard warehouse design can be used
  • {operation} sharing a warehouse (aka warehouse sharing)
    • a way to provide users read access to the warehouse for downstream consumption
      • via SQL, Spark, or Power BI
    • the level of permissions can be customized to provide the appropriate level of access
  • {feature} mirroring 
    • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
      • any database can be accessed and managed centrally from within Fabric without having to switch database clients
      • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
  • {feature} v-order
    • write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engine [5]
  • {feature} caching
    • stores frequently accessed data and metadata in a faster storage layer [6]
  • {concept} SQL analytics endpoint 
    • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
  • {concept} virtual warehouse
    • can containing data from virtually any source by using shortcuts [3]
  • {concept} cross database querying 
    • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]

    References:
    [1] Microsoft Learn (2023) Fabric: Get started with data warehouses in Microsoft Fabric (link
    [2] Microsoft Learn (2023) Fabric: Microsoft Fabric decision guide: choose a data store (link)
    [3] Microsoft Learn (2024) Fabric: What is data warehousing in Microsoft Fabric? (link)
    [4] Microsoft Learn (2023) Fabric: Better together: the lakehouse and warehouse (link)
    [5] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
    [6] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
    [7] Microsoft Learn (2024) Fabric: 

    Resources:
    [R1] Microsoft Learn (2023) Fabric: Data warehousing documentation in Microsoft Fabric (link)
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    ETL - Extract, Transfer, Load
    MF - Microsoft Fabric

    13 April 2024

    📊R Language: Using the lessR Package in Microsoft Fabric's Notebooks (Test Drive)

    I've started to use again the R languages for data visualizations. Discovering the lessR package, which simplifies considerably the verbose syntax of the R language by encapsulating the functionality behind simple functions, I wondered whether it can be installed in Microsoft Fabric and used from notebooks. Besides the available documentation, for learning I used also David W Berging's book on R visualizations [3].

    Into a new notebook, I used one cell for each installation or package retrieval (see [1], [2]):

    #installing packages
    #install.packages("tidyverse") #is in Microsoft Fabric preinstalled
    install.packages("lessR")
    
    #retrieve packages from library
    library("tidyverse")
    library("lessR")
    

    I attempted to read the data from a http location via the lessR Read function and it worked 

    d <- Read("http://lessRstats.com/data/employee.xlsx")
    

    head(d)

    However, attempting to use any of the lessR functions used for visualization displayed only the text output and not the visualizations. No matter what I did - suppressing the text, suppressing the generation of PDF files, the result was the same. It seems to be a problem with the output device, though I'm not sure how to solve this yet. 

    # supressing the text
    style(quiet=TRUE)
    
    # reenabling the text
    style(quiet=FALSE)
    
    # supressing PDF  generation
    pdf(NULL)
    

    # retrieving current device used 
    options()$device

    I was able to run the ggplot2 scripts from [3] though only when the lessR was also installed (each script should be run in its own cell, otherwise only the last plot is shown):

    # bard charts 
    ggplot(d) + geom_bar(aes(Dept)) 
    
    # histogram
    ggplot(d, aes(Salary)) + geom_histogram(binwidth=10000) 
    
    # integrated violin/box/scatterplot
    ggplot(d, aes(x="", y=Salary)) +
    geom_violin(fill="gray90", bw=9500, alpha=.3) +
    geom_boxplot(fill="gray75", outlier.color="black", width=0.25) +
    geom_jitter(shape=16, position=position_jitter(0.05)) +
    theme(axis.title.y=element_blank()) +
    coord_flip()
    
    # enhanced scatterplot 
    ggplot(d, aes(Years, Salary)) + geom_point() +
    geom_smooth(method=lm, color="black") +
    stat_ellipse(type="norm") +
    geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") +
    geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")
    

    Similar results could be obtained by using the following lessR syntax in RStudio:

    # bard charts 
    BarChart(Dept)
    
    # histogram
    Histogram(Salary) 
    
    # integrated violin/box/scatterplot
    Plot(Salary)
    
    # enhanced scatterplot 
    Plot(Years, Salary, enhance=TRUE)
    

    Trying to see whether I can access the data from a lakehouse via SparkR, I've downloaded the file from the support website [3], loaded the data into an available lakehouse (e.g. UAT), respectiveley loaded the data to a new table:

    -- creating the table
    CREATE TABLE [dbo].[employee](
    	[Name] [varchar](8000) NULL,
    	[Years] [int] NULL,
    	[Gender] [varchar](8000) NULL,
    	[Dept] [varchar](8000) NULL,
    	[Salary] [float] NULL,
    	[JobSat] [varchar](8000) NULL,
    	[Plan] [int] NULL,
    	[Pre] [int] NULL,
    	[Post] [int] NULL
    ) ON [PRIMARY]
    GO
    
    -- checking the data
    SELECT *
    FROM [dbo].[employee]

    I was able to access the content of the imported file via the following script:

    #access the file from lakehouse
    #csv_file <- "https://onelake.dfs.fabric.microsoft.com/<file_system>/<account_name>/Files/OpenSource/employee.csv"
    #csv_file <- "abfss://<file_system>.dfs.fabric.microsoft.com/<account_name>/Files/OpenSource/employee.csv"

    csv_file <- "Files/OpenSource/employee.csv"

    df <- read.df(csv_file, source= "csv", header = "true", inferSchema = "true")
    
    display(df)
    

    Initially, I wasn't able to access the table directly, though in the end I was able to retrieve the data (without and with the catalog's name):

    # creating a data frame via SparkSQL
    dfEmp <- sql("SELECT * FROM Employee")
    
    head(dfEmp)
    

    Comments:
    1) Once the sessions timeout, it seems that one needs to rerun the scripts, which proves to be time-consuming as the installation takes about 5 minutes. 
    2) Being able to use lessR directly in Microsoft Fabric could be a real win given its simple syntax. I run most of the tests from the book [3] plus some of the recommended scripts and the results are satisfactory. 
    3) The connection via the ABFS path to the lakehouse works as well, but not via URL. 

    Previous Post <<||>> Next Post

    17 March 2024

    🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

    Business Intelligence
    Business Intelligence Series

    One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content). 

    At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply. 

    For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs. 

    Data Products with a Data Mesh
    Data Products with a Data Mesh

    Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort. 

    Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible. 

    It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

    Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

    Previous Post <<||>> Next Post

    References:
    [1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review

    10 March 2024

    🏭📑Microsoft Fabric: Lakehouse [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 10-Mar-2024

    [Microsoft Fabric] Lakehouse

    • {def} data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location [3]
    • a unified platform that combines the capabilities of 
      • data lake
        • built on top of the OneLake scalable storage layer using and Delta format tables [1]
          • support ACID  transactions through Delta Lake formatted tables for data consistency and integrity [1]
            • ⇒ scalable analytics solution that maintains data consistency [1]
        • {capability} scalable, distributed file storage
          • can scale automatically and provide high availability and disaster recovery [1]
        • {capability} flexible schema-on-read semantics
          • ⇒ the schema can be changed as needed [1]
          • ⇐ rather than having a predefined schema
        • {capability} big data technology compatibility
          • store all data formats 
          • can be used with various analytics tools and programming languages
          • use Spark and SQL engines to process large-scale data and support machine [1] learning or predictive modeling analytics
      • data warehouse
        • {capability} relational schema modeling
        • {capability} SQL-based querying
          • {feature} has a built-in SQL analytics endpoint
            • ⇐ the data can be queried by using SQL without any special setup [2]
        • {capability} proven basis for analysis and reporting
        • ⇐ unlocks data warehouse capabilities without the need to move data [2]
      • ⇐ a database built on top of a data lake 
        • ⇐  includes metadata
      • ⇐ a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location [2]
        • ⇒ single location for data engineers, data scientists, and data analysts to access and use data [1]
        • ⇒ it can easily scale to large data volumes of all file types and sizes
        • ⇒ it's easily shared and reused across the organization
    • supports data governance policies [1]
      • e.g. data classification and access control
    • can be created in any premium tier workspace [1]
      • appears as a single item within the workspace in which was created [1]
        • ⇒ access is controlled at this level as well [1]
          • directly within Fabric
          • via the SQL analytics endpoint
    • permissions
      • granted either at the workspace or item level [1]
    • users can work with data via
      • lakehouse UI
        • add and interact with tables, files, and folders [1]
      • SQL analytics endpoint 
        • enables to use SQL to query the tables in the lakehouse and manage its relational data model [1]
    • two physical storage locations are provisioned automatically
      • tables
        • a managed area for hosting tables of all formats in Spark
          • e.g. CSV, Parquet, or Delta
        • all tables are recognized as tables in the lakehouse
        • delta tables are recognized as tables as well
      • files 
        • an unmanaged area for storing data in any file format [2]
        • any Delta files stored in this area aren't automatically recognized as tables [2]
        • creating a table over a Delta Lake folder in the unmanaged area requires to explicitly create a shortcut or an external table with a location that points to the unmanaged folder that contains the Delta Lake files in Spark [2]
      • ⇐ the main distinction between the managed area (tables) and the unmanaged area (files) is the automatic table discovery and registration process [2]
        • {concept} registration process
          • runs over any folder created in the managed area only [2]
    • {operation} ingest data into lakehouse
      • {medium} manual upload
        • upload local files or folders to the lakehouse
      • {medium} dataflows (Gen2)
        • import and transform data from a range of sources using Power Query Online, and load it directly into a table in the lakehouse [1]
      • {medium} notebooks
        • ingest and transform data, and load it into tables or files in the lakehouse [1]
      • {medium} Data Factory pipelines
        • copy data and orchestrate data processing activities, loading the results into tables or files in the lakehouse [1]
    • {operation} explore and transform data
      • {medium} notebooks
        •  use code to read, transform, and write data directly to the lakehouse as tables and/or files [1]
      • {medium} Spark job definitions
        • on-demand or scheduled scripts that use the Spark engine to process data in the lakehouse [1]
      • {medium} SQL analytic endpoint: 
        • run T-SQL statements to query, filter, aggregate, and otherwise explore data in lakehouse tables [1]
      • {medium} dataflows (Gen2): 
        • create a dataflow to perform subsequent transformations through Power Query, and optionally land transformed data back to the lakehouse [1]
      • {medium} data pipelines: 
        • orchestrate complex data transformation logic that operates on data in the lakehouse through a sequence of activities [1]
          • (e.g. dataflows, Spark jobs, and other control flow logic).
    • {operation} analyze and visualize data
      • use the semantic model as the source for Power BI reports 
    • {concept} shortcuts
      • embedded references within OneLake that point to other files or storage locations
      • enable to integrate data into lakehouse while keeping it stored in external storage [1]
        • ⇐ allow to quickly source existing cloud data without having to copy it
        • e.g. different storage account or different cloud provider [1]
        • the user must have permissions in the target location to read the data [1]
        • data can be accessed via Spark, SQL, Real-Time Analytics, and Analysis Services
      • appear as a folder in the lake
      • {limitation} have limited data source connectors
        • {alternatives} ingest data directly into your lakehouse [1]
      • enable Fabric experiences to derive data from the same source to always be in sync
    • {concept} Lakehouse Explorer
      • enables to browse files, folders, shortcuts, and tables; and view their contents within the Fabric platform [1]

    References:
    [1] Microsoft Learn: Fabric (2023) Get started with lakehouses in Microsoft Fabric (link)
    [2] Microsoft Learn: Fabric (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
    [3] Microsoft Learn: Fabric (2023) What is a lakehouse in Microsoft Fabric? [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    🏭🗒️Microsoft Fabric: Dataflows Gen2 [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 10-Mar-2024

    Dataflow (Gen2) Architecture [4]

    [Microsoft Fabric] Dataflow (Gen2) 

    •  new generation of dataflows that resides alongside the Power BI Dataflow (Gen1) [2]
    • allows to 
      • extract data from various sources
      • transform it using a wide range of transformation operations 
      • load it into a destination [1]
    • {goal} provide an easy, reusable way to perform ETL tasks using Power Query Online [1]
      • allows to promote reusable ETL logic 
        • ⇒ prevents the need to create more connections to the data source.
        • offer a wide variety of transformations
      • can be horizontally partitioned
    • {component} Lakehouse 
      • used to stage data being ingested
    • {component} Warehouse 
      • used as a compute engine and means to write back results to staging or supported output destinations faster
    • {component} Mashup Engine
      • extracts, transforms, or loads the data to staging or data destinations when either [4]
        • Warehouse compute cannot be used [4]
        • staging is disabled for a query [4]
    • {operation} creating a dataflow
      • can be created in a
        • Data Factory workload
        • Power BI workspace
        • Lakehouse
    • {operation} publishing a dataflow
      • generates dataflow's definition  
        • ⇐ the program that runs once the dataflow is refreshed to produce tables in staging storage and/or output destination [4]
        • used by the dataflow engine to generate an orchestration plan, manage resources, and orchestrate execution of queries across data sources, gateways, and compute engines, and to create tables in either the staging storage or data destination [4]
      • saves changes and runs validations that must be performed in the background [2]
    • {operation} refreshing a dataflow
    • {operation} running a dataflow 
      • can be run
        • manually
        • on a refresh schedule
        • as part of a Data Pipeline orchestration
    • {feature} author dataflows with Power Query
      • uses the full Power Query experience of Power BI dataflows [2]
    • {feature} shorter authoring flow
      • uses step-by-step for getting the data into your the dataflow [2]
        • the number of steps required to create dataflows were reduced [2]
      • a few new features were added to improve the experience [2]
    • {feature} Auto-Save and background publishing
      • changes made to a dataflow are autosaved to the cloud (aka draft version of the dataflow) [2]
        • ⇐ without having to wait for the validation to finish [2]
      • {functionality} save as draft 
        • stores a draft version of the dataflow every time you make a change [2]
        • seamless experience and doesn't require any input [2]
      • {concept} published version
        • the version of the dataflow that passed validation and is ready to refresh [5]
    • {feature} integration with data pipelines
      • integrates directly with Data Factory pipelines for scheduling and orchestration [2] 
    • {feature} high-scale compute
      • leverages a new, higher-scale compute architecture [2] 
        •  improves the performance of both transformations of referenced queries and get data scenarios [2]
        • creates both Lakehouse and Warehouse items in the workspace, and uses them to store and access data to improve performance for all dataflows [2]
    • {feature} improved monitoring and refresh history
      • integrate support for Monitoring Hub [2]
      • Refresh History experience upgraded [2]
    • {feature} get data via Dataflows connector
      • supports a wide variety of data source connectors
        • include cloud and on-premises relational databases
    • {feature|planned} incremental refresh 
      • enables you to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
    • {feature|planned} Fast Copy 
      • enables large-scale data ingestion directly utilizing the pipelines Copy Activity capability [6]
      • supports sources such Azure SQL Databases, CSV, and Parquet files in Azure Data Lake Storage and Blob Storage [6]
      • significantly scales up the data processing capacity providing high-scale ELT capabilities [6]
    • {feature|planned}Cancel refresh
      • enables to cancel ongoing Dataflow Gen2 refreshes from the workspace items view [6]
    • {feature} data destinations
      • allows to 
        • specify an output destination
        • separate ETL logic and destination storage [2]
      • every tabular data query can have a data destination [3]
        • available destinations
          • Azure SQL databases
          • Azure Data Explorer (Kusto)
          • Fabric Lakehouse
          • Fabric Warehouse
          • Fabric KQL database
        • a destination can be specified for every query individually [3]
        • multiple different destinations can be used within a dataflow [3]
        • connecting to the data destination is similar to connecting to a data source
        • {limitation} functions and lists aren't supported
      • {operation} creating a new table
        • {default} table name has the same name as the query name.
      • {operation} picking an existing table
      • {operation} deleting a table manually from the data destination 
        • doesn't recreate the table on the next refresh [3]
      • {operation} reusing queries from Dataflow Gen1
        • {method} export Dataflow Gen1 query and import it into Dataflow Gen2
          • export the queries as a PQT file and import them into Dataflow Gen2 [2]
        • {method} copy and paste in Power Query
          • copy the queries and paste them in the Dataflow Gen2 editor [2]
      • automatic settings:
        • {limitation} supported only for Lakehouse and Azure SQL database
        • {setting} Update method replace: 
          • data in the destination is replaced at every dataflow refresh with the output data of the dataflow [3]
        • {setting} Managed mapping: 
          • the mapping is automatically adjusted when republishing the data flow to reflect the change 
            • ⇒ doesn't need to be updated manually into the data destination experience every time changes occur [3]
        • {setting} Drop and recreate table: 
          • on every dataflow refresh the table is dropped and recreated to allow schema changes
          • {limitation} the dataflow refresh fails if any relationships or measures were added to the table [3]
      • update methods
        • {method} replace: 
          • on every dataflow refresh, the data is dropped from the destination and replaced by the output data of the dataflow.
          • {limitation} not supported by Fabric KQL databases and Azure Data Explorer 
        • {method} append: 
          • on every dataflow refresh, the output data from the dataflow is appended (aka merged) to the existing data in the data destination table (aka upsert)
      • staging 
        • {default} enabled
          • allows to use Fabric compute to execute queries
            • ⇐ enhances the performance of query processing
          • the data is loaded into the staging location
            • ⇐ an internal Lakehouse location accessible only by the dataflow itself
          • [Warehouse] staging is required before the write operation to the data destination
            • ⇐ improves performance
            • {limitation} only loading into the same workspace as the dataflow is supported
          •  using staging locations can enhance performance in some cases
        • disabled
          • {recommendation} [Lakehouse] disable staging on the query to avoid loading twice into a similar destination
            • ⇐ once for staging and once for data destination
            • improves dataflow's performance
      • {scenario} use a dataflow to load data into the lakehouse and then use a notebook to analyze the data [2]
      • {scenario} use a dataflow to load data into an Azure SQL database and then use a data pipeline to load the data into a data warehouse [2]
    • {feature} fast copy
      • allows ingesting terabytes of data with the easy experience and the scalable back-end of the pipeline Copy Activity [7]
      • the feature must be enabled [7]
        • after enabling, Dataflows automatically switch the back-end when data size exceeds a particular threshold [7]
        • ⇐there's no need to change anything during authoring of the dataflows
        • one can check thr refresh history to see if fast copy was used [7]
        • ⇐see the Engine typeRequire fast copy option
        • {option} Require fast copy
      • {prerequisite} Fabric capacity is available [7]
      • {prerequisite} data files 
        • are in .csv or parquet format
        • have at least 100 MB
        • are stored in an ADLS Gen2 or a Blob storage account [6]
      • {prerequisite} [Azure SQL DB|PostgreSQL] >= 5 million rows in the data source [7]
      • {limitation} doesn't support [7] 
        • the VNet gateway
        • writing data into an existing table in Lakehouse
        • fixed schema
    • {benefit} extends data with consistent data, such as a standard date dimension table [1]
    • {benefit} allows self-service users access to a subset of data warehouse separately [1]
    • {benefit} optimizes performance with dataflows, which enable extracting data once for reuse, reducing data refresh time for slower sources [1]
    • {benefit} simplifies data source complexity by only exposing dataflows to larger analyst groups [1]
    • {benefit} ensures consistency and quality of data by enabling users to clean and transform data before loading it to a destination [1]
    • {benefit} simplifies data integration by providing a low-code interface that ingests data from various sources [1]
    • {limitation} not a replacement for a data warehouse [1]
    • {limitation} row-level security isn't supported [1]
    • {limitation} Fabric or Fabric trial capacity workspace is required [1]


    Feature Data flow Gen2 Dataflow Gen1
    Author dataflows with Power Query
    Shorter authoring flow
    Auto-Save and background publishing
    Data destinations
    Improved monitoring and refresh history
    Integration with data pipelines
    High-scale compute
    Get Data via Dataflows connector
    Direct Query via Dataflows connector
    Incremental refresh ✓*
    Fast Copy ✓*
    Cancel refresh ✓*
    AI Insights support
    Dataflow Gen1 vs Gen2 [2]


    Acronyms:
    ADLS - Azure Data Lake Storage 
    ETL - Extract, Transform, Load
    KQL - Kusto Query Language
    PQO - Power Query Online
    PQT - Power Query Template

    References:
    [1] Microsoft Learn (2023) Fabric: Ingest data with Microsoft Fabric (link)
    [2] Microsoft Learn (2023) Fabric: Getting from Dataflow Generation 1 to Dataflow Generation 2 (link)
    [3] Microsoft Learn (2023) Fabric: Dataflow Gen2 data destinations and managed settings (link)
    [4] Microsoft Learn (2023) Fabric: Dataflow Gen2 pricing for Data Factory in Microsoft Fabric (link)
    [5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow (link)
    [6] Microsoft Learn (2023) Fabric: What's new and planned for Data Factory in Microsoft Fabric (link)

    [7] Microsoft Learn (2023) Fabric: Fast copy in Dataflows Gen2 [link

    Resources:
    [R1] Arshad Ali & Bradley Schacht (2024) Learn Microsoft Fabric (link)
    [R2] Microsoft Learn: Fabric (2023) Data Factory limitations overview (link)
    [R3] Microsoft Fabric Blog (2023) Data Factory Spotlight: Dataflow Gen2, by Miguel Escobar (link)
    [R4] Microsoft Learn: Fabric (2023) Dataflow Gen2 connectors in Microsoft Fabric (link) 
    [R5] Microsoft Learn: Fabric (2023) Pattern to incrementally amass data with Dataflow Gen2 (link)
    [R6] Fourmoo (2004) Microsoft Fabric – Comparing Dataflow Gen2 vs Notebook on Costs and usability, by Gilbert Quevauvilliers (link)
    [R7] Microsoft Learn: Fabric (2023) A guide to Fabric Dataflows for Azure Data Factory Mapping Data Flow users (link)
    [R8] Microsoft Learn: Fabric (2023) Quickstart: Create your first dataflow to get and transform data (link)
    [R9] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: copy activity, dataflow, or Spark (link)
    [R10] Microsoft Fabric Blog (2023) Dataflows Gen2 data destinations and managed settings, by Miquella de Boer  (link)
    [R11] Microsoft Fabric Blog (2023) Service principal support to connect to data in Dataflow, Datamart, Dataset and Dataflow Gen 2, by Miquella de Boer (link)
    [R12] Chris Webb's BI Blog (2023) Fabric Dataflows Gen2: To Stage Or Not To Stage? (link)
    [R13] Power BI Tips (2023) Let's Learn Fabric ep.7: Fabric Dataflows Gen2 (link)
    [R14] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.