Showing posts with label ADLS. Show all posts
Showing posts with label ADLS. Show all posts

15 March 2025

🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Microsoft Fabric] Azure Data Lake Storage Gen2 (ADLS Gen2)

  • {def} a set of capabilities built on Azure Blob Storage and dedicated to big data analytics [27]
  • {capability} low-cost storage
    • provides cloud storage that is less expensive than the cloud storage that relational databases provide [25]
  • {capability} performant massive storage
    • designed to service multiple petabytes of information while sustaining hundreds of gigabits of throughput [27]
      • processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels [27]
  • {capability} tiered storage
  • {capability} high availability/disaster recovery
  • {capability} file system semantics
  • {capability} file-level security
  • {capability} scalability
  • supports a range of tools and programming languages that enable large amounts of data to be reported on, queried, and transformed [25]
  • included in customer subscriptions
    • customers can bring their data lake and integrate it with D365FO
  • combines BYOD and Entity store [25]
  • entity store 
    • staged in the data lake and provides a set of simplified (denormalized) data structures to make reporting easier [25]
      • users can be given direct access to the relevant data and can create reports [25]
    • instead of exporting data by using BYOD, customers can select the data that is staged in the data lake
      • the data in the data lake are synchronized via the D365FO data feed service [25]
      • reflects the updated D365FO data within a few minutes after the changes occur [25]
    • further data can be brought into the data lake
    • the data can be consumed via cloud-based services [25]
    • {benefit} no need to monitor and manage complex data export and orchestration schedules [25]
    • {benefit} no user intervention is required to update data in the data lake [25]
    • o{benefit} cost effective data storage
  • [licensing] included in a customer's subscription
    • the customer must pay for 
      • data storage
      • I/O costs that are incurred when data is read and written to the data lake [25]
      • I/O costs because D365FO apps write data to the data lake or update the data in it [25]
  • {prerequisite} the data lakes must be provisioned in the same country or region as the D365 environment [25]
    • the stored data comply with the CMD folder standard
    • integration makes warm-path reporting available as the default reporting option
      • because data in a data lake is updated within minutes [25]
      • approach acceptable for most reporting scenarios (incl. near-real-time reporting) [25]
  • {benefit} designed to store large amounts of data [25]
  • {benefit} designed for big data analytics [25]
  • {benefit} has many associated services that enable analytics, data transformation, and the application of AI and machine learning [25]
    • hierarchical namespace 
      • organizes objects/files into a hierarchy of directories for efficient data access [27]
      • renaming or deleting a directory become single atomic metadata operations on the directory [28]
      • no need to enumerate and process all objects that share the name prefix of the directory [28]
  • {feature} query acceleration 
    • enables applications and analytics frameworks to significantly optimize data processing by retrieving only the data that they require to perform a given operation [28]
    • accepts filtering predicates and column projections which enable applications to filter rows and columns at the time that data is read from disk [28]
      • only the data that meets the conditions of a predicate are transferred over the network to the application [28]
      • reduces network latency and compute cost [28]
        • a query acceleration request is specified via SQL
      • a request processes only one file
    • advanced relational features of SQL aren't supported [28] 
      • e.g. joins and group by aggregates,
    • supports CSV and JSON formatted data as input to each request
    • compatible with the blobs in storage accounts that don't have a hierarchical namespace enabled on them [28]
    • designed for distributed analytics frameworks 
      • e.g. Apache Spark and Apache Hive
        • these engines include query optimizers that can incorporate knowledge of the underlying I/O service's capabilities when determining an optimal query plan for user queries [28]
        • these frameworks integrate query acceleration
          • ⇒ users see improved query latency and a lower total cost of ownership without having to make any changes to the queries 
      • includes a storage abstraction layer within the framework  [28]
    • designed for data processing applications that perform large-scale data transformations that might not directly lead to analytics insights [28]


References
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link
[27] Introduction to Azure Data Lake Storage Gen2 [link
[28] Azure Data Lake Storage query acceleration [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output
JSON - JavaScript Object Notation)

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2025

[Microsoft Fabric] Shortcut

  • {def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access
    • {goal} unifies existing data without copying or moving it [2]
      • ⇒ data can be used multiple times without being duplicated [2]
      • {benefit} helps to eliminate edge copies of data [1]
      • {benefit} reduces process latency associated with data copies and staging [1]
    • is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]
      • ⇒ allows creating a single virtual data lake for the entire enterprise [1]
      • ⇐ available in all Fabric experiences [1]
      • ⇐ behave like symbolic links [1]
    • independent object from the target [1]
    • appear as folder [1]
    • can be used by workloads or services that have access to OneLake [1]
    • transparent to any service accessing data through the OneLake API [1]
    • can point to 
      • OneLake locations
      • ADLS Gen2 storage accounts
      • Amazon S3 storage accounts
      • Dataverse
      • on-premises or network-restricted locations via PDF 
  • {capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
  • {capability} data can be compose throughout OneLake without any data movement [2]
  • {capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]
    • ⇐ makes OneLake the first multi-cloud data lake [2]
  • {capability} provides support for industry standard APIs
    • ⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]
  • {operation} creating a shortcut
    • can be created in 
      • lakehouses
      • KQL databases
        • ⇐ shortcuts are recognized as external tables [1]
    • can be created via 
      • Fabric UI 
      • REST API
    • can be created across items [1]
      • the item types don't need to match [1]
        • e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]
    • [lakehouse] tables folder
      • represents the managed portion of the lakehouse 
        • shortcuts can be created only at the top level [1]
          • ⇒ shortcuts aren't supported in other subdirectories [1]
        • if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]
    • [lakehouse] files folder
      • represents the unmanaged portion of the lakehouse [1]
      • there are no restrictions on where shortcuts can be created [1]
        • ⇒ can be created at any level of the folder hierarchy [1]
        • ⇐table discovery doesn't happen in the Files folder [1]
    • [lakehouse] all shortcuts are accessed in a delegated mode when querying through the SQL analytics endpoint [5]
      • the delegated identity is the Fabric user that owns the lakehouse [5]
        • {default} the owner is the user that created the lakehouse and SQL analytics endpoint [5]
          •  ⇐ can be changed in select cases 
          • the current owner is displayed in the Owner column in Fabric when viewing the item in the workspace item list
        • ⇒ the querying user is able to read from shortcut tables if the owner has access to the underlying data, not the user executing the query [5]
          • ⇐ the querying user only needs access to select from the shortcut table [5]
      •  {feature} OneLake data access roles 
        • {enabled} access to a shortcut is determined by whether the SQL analytics endpoint owner has access to see the target lakehouse and read the table through a OneLake data access role [5]
        • {disabled} shortcut access is determined by whether the SQL analytics endpoint owner has the Read and ReadAll permission on the target path [5]
  • {operation} renaming a shortcut
  • {operation} moving a shortcut
  • {operation} deleting a shortcut 
    • doesn't affect the target [1]
      • ⇐ only the shortcut object is deleted [1]
    • shortcuts don't perform cascading deletes [1]
    • moving, renaming, or deleting a target path can break the shortcut [1]
  • {operation} delete file/folder
    • file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]
  • {permissions} users must have permissions in the target location to read the data [1]
    • when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1]
    • when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]
      •  the calling item owner’s identity is passed instead, delegating access to the calling user [1]
    • OneLake manages all permissions and credentials
  • {feature} shortcut caching 
    • {def} mechanism used to reduce egress costs associated with cross-cloud data access [1]
      • when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]
        • subsequent read requests are served from cache rather than the remote storage provider [1]
        • cached files have a retention period of 24 hours
        • each time the file is accessed the retention period is reset [1]
        • if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
        • if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]
    • {restriction} individual files greater than 1 GB in size are not cached [1]
    • {restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]
  • {feature} query acceleration
    • caches data as it lands in OneLake, providing performance comparable to ingesting data in Eventhouse [4]
  • {limitation} maximum number of shortcuts [1] 
    • per Fabric item: 100,000
    • in a single OneLake path: 10
    • direct shortcuts to shortcut links: 5
  • {limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
  • {limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
  • {limitation} shortcuts don't support non-Latin characters[1]
  • {limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
  • {limitation} copy function doesn't work on shortcuts that directly point to ADLS containers
    • {recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]
  • {limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
  • {limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
  • {limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]
References:
[1] Microsoft Learn (2024) Fabric: OneLake shortcuts [link]
[2] Microsoft Learn (2024) Fabric Analyst in a Day [course notes]
[3] Microsoft Learn (2024) Use OneLake shortcuts to access data across capacities: Even when the producing capacity is paused! [link]
[4] Microsoft Learn (2024) Fabric: Query acceleration for OneLake shortcuts - overview (preview) [link]
[5] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

Acronyms:
ADLS - Azure Data Lake Storage
API - Application Programming Interface
AWS - Amazon Web Services
GCS - Google Cloud Storage
KQL - Kusto Query Language
OPDG - on-premises data gateway
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.