SQL Troubles: ADLS

Showing posts with label ADLS. Show all posts

15 March 2025

🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Microsoft Fabric] Azure Data Lake Storage Gen2 (ADLS Gen2)

{def} a set of capabilities built on Azure Blob Storage and dedicated to big data analytics [27]
{capability} low-cost storage

provides cloud storage that is less expensive than the cloud storage that relational databases provide [25]

{capability} performant massive storage

designed to service multiple petabytes of information while sustaining hundreds of gigabits of throughput [27]

processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels [27]

{capability} tiered storage
{capability} high availability/disaster recovery
{capability} file system semantics
{capability} file-level security
{capability} scalability
supports a range of tools and programming languages that enable large amounts of data to be reported on, queried, and transformed [25]
included in customer subscriptions

customers can bring their data lake and integrate it with D365FO

combines BYOD and Entity store [25]
entity store

staged in the data lake and provides a set of simplified (denormalized) data structures to make reporting easier [25]

users can be given direct access to the relevant data and can create reports [25]

instead of exporting data by using BYOD, customers can select the data that is staged in the data lake

the data in the data lake are synchronized via the D365FO data feed service [25]
reflects the updated D365FO data within a few minutes after the changes occur [25]

further data can be brought into the data lake
the data can be consumed via cloud-based services [25]
{benefit} no need to monitor and manage complex data export and orchestration schedules [25]
{benefit} no user intervention is required to update data in the data lake [25]
o{benefit} cost effective data storage

[licensing] included in a customer's subscription

the customer must pay for

data storage
I/O costs that are incurred when data is read and written to the data lake [25]
I/O costs because D365FO apps write data to the data lake or update the data in it [25]

{prerequisite} the data lakes must be provisioned in the same country or region as the D365 environment [25]

the stored data comply with the CMD folder standard
integration makes warm-path reporting available as the default reporting option

because data in a data lake is updated within minutes [25]
approach acceptable for most reporting scenarios (incl. near-real-time reporting) [25]

{benefit} designed to store large amounts of data [25]
{benefit} designed for big data analytics [25]
{benefit} has many associated services that enable analytics, data transformation, and the application of AI and machine learning [25]

hierarchical namespace

organizes objects/files into a hierarchy of directories for efficient data access [27]
renaming or deleting a directory become single atomic metadata operations on the directory [28]
no need to enumerate and process all objects that share the name prefix of the directory [28]

{feature} query acceleration

enables applications and analytics frameworks to significantly optimize data processing by retrieving only the data that they require to perform a given operation [28]
accepts filtering predicates and column projections which enable applications to filter rows and columns at the time that data is read from disk [28]

only the data that meets the conditions of a predicate are transferred over the network to the application [28]
reduces network latency and compute cost [28]

a query acceleration request is specified via SQL

a request processes only one file

advanced relational features of SQL aren't supported [28]

e.g. joins and group by aggregates,

supports CSV and JSON formatted data as input to each request
compatible with the blobs in storage accounts that don't have a hierarchical namespace enabled on them [28]
designed for distributed analytics frameworks

e.g. Apache Spark and Apache Hive

these engines include query optimizers that can incorporate knowledge of the underlying I/O service's capabilities when determining an optimal query plan for user queries [28]
these frameworks integrate query acceleration

⇒ users see improved query latency and a lower total cost of ownership without having to make any changes to the queries

includes a storage abstraction layer within the framework [28]

designed for data processing applications that perform large-scale data transformations that might not directly lead to analytics insights [28]

Previous Post <<||>> Next Post

References
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link]

[27] Introduction to Azure Data Lake Storage Gen2 [link]

[28] Azure Data Lake Storage query acceleration [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output

JSON - JavaScript Object Notation)

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 29-May-2025

[Microsoft Fabric] Shortcut

{def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access

serves as virtual pointer to data stored in other locations [6]
{goal} unifies existing data without copying or moving it [2]

⇒ data can be used multiple times without being duplicated [2]
{benefit} helps to eliminate edge copies of data [1]
{benefit} reduces process latency associated with data copies and staging [1]

is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]

⇒ allows creating a single virtual data lake for the entire enterprise [1]
⇐ available in all Fabric experiences [1]
⇐ behave like symbolic links [1]

independent object from the target [1]
appear as folder [1]
can be used by workloads or services that have access to OneLake [1]
transparent to any service accessing data through the OneLake API [1]

can point to

OneLake locations
ADLS Gen2 storage accounts
Amazon S3 storage accounts
Dataverse
on-premises or network-restricted locations via PDF

{capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
{capability} data can be compose throughout OneLake without any data movement [2]
{capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]

⇐ makes OneLake the first multi-cloud data lake [2]

{capability} provides support for industry standard APIs

⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]

{operation} creating a shortcut

can be created in

lakehouses
KQL databases

⇐ shortcuts are recognized as external tables [1]

can be created via

Fabric UI
REST API

can be created across items [1]

the item types don't need to match [1]

e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]

[lakehouse] tables folder

represents the managed portion of the lakehouse

shortcuts can be created only at the top level [1]

⇒ shortcuts aren't supported in other subdirectories [1]

if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]

[lakehouse] files folder

represents the unmanaged portion of the lakehouse [1]
there are no restrictions on where shortcuts can be created [1]

⇒ can be created at any level of the folder hierarchy [1]
⇐ table discovery doesn't happen in the Files folder [1]

[lakehouse] all shortcuts are accessed in a delegated mode when querying through the SQL analytics endpoint [5]

the delegated identity is the Fabric user that owns the lakehouse [5]

{default} the owner is the user that created the lakehouse and SQL analytics endpoint [5]

⇐ can be changed in select cases
the current owner is displayed in the Owner column in Fabric when viewing the item in the workspace item list

⇒ the querying user is able to read from shortcut tables if the owner has access to the underlying data, not the user executing the query [5]

⇐ the querying user only needs access to select from the shortcut table [5]

{feature} OneLake data access roles

{enabled} access to a shortcut is determined by whether the SQL analytics endpoint owner has access to see the target lakehouse and read the table through a OneLake data access role [5]
{disabled} shortcut access is determined by whether the SQL analytics endpoint owner has the Read and ReadAll permission on the target path [5]

{operation} renaming a shortcut
{operation} moving a shortcut
{operation} deleting a shortcut

doesn't affect the target [1]

⇐ only the shortcut object is deleted [1]

shortcuts don't perform cascading deletes [1]
moving, renaming, or deleting a target path can break the shortcut [1]

{operation} delete file/folder

file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]

{permissions} users must have permissions in the target location to read the data [1]

when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1] (aka passthrough auth model [6])

ensures that any user accessing the shortcut is only able to see whatever they have access to in the target [6]
the security from the target ‘flows across’ the shortcut to restrict access in the source lakehouse [6]
OneLake to OneLake shortcuts support only passthrough mode [6]

ensures that the source system retains full control over its data [6]

⇐ there’s no need to replicate or redefine access controls for the shortcut [6]
{benefit} reduces administrative overhead since security policies only need to be maintained in one place [6]
{constraint} security cannot be modified directly from the downstream item [6]

ensures that the source system retains full control over its data [6]

any changes to access permissions must be made at the source location [6]
the source remains the single point of truth for access control [6]

⇐ ensures consistency
⇐ minimes the risk of misconfiguration [6]

{type} delegated auth mode

shortcuts access data by using some intermediate credential

e.g. another user or an account key
allow for permission management to be separated or ‘delegated’ to another team or downstream user to manage [6]

always break the flow of security from one system to another [6]
all delegated shortcuts in OneLake can have OneLake security roles defined for them [6]

all shortcuts from OneLake to external systems are delegated [6]

e.g. AWS S3 or Google Cloud Storage
allows users to connect to the external system without being given direct access [6]
OneLake security can then be configured on the shortcut to limit what data in the external system can be accessed [6]

when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]

the calling item owner’s identity is passed instead, delegating access to the calling user [1]

OneLake manages all permissions and credentials

{type} OneLake to OneLake shortcuts

ideal for ensuring the hub retains control over sensitive or regulated data [6]

each downstream team

can then only consume the data they are allowed to [6]
has the freedom to create its own reports or combine the hub data with other data that they own [6]

{concept} hub-and-spoke model

allows to manage the data access across multiple teams or departments [6]
{component} hub

the central data repository where core datasets are stored [6]
security policies are meticulously defined to ensure robust control [6]

{component} spokes

individual teams or departments access the hub’s data through shortcuts [6]

{advantage} enables centralized governance while allowing decentralized consumption and use of data [6]
can be leveraged in various ways to create efficient and secure data architectures [6]

{type} delegated shortcuts

allow to share data securely centralize data across clouds, without copying it [6]

the data that already exists in various cloud storage accounts is consolidated in OneLake through the use of delegated shortcuts [6]
a new lakehouse is created as the consolidation point [6]
each external data source is connected via a delegated shortcut [6]

the admin can define OneLake security roles to govern access
granularity: row, column, schemas or shortcuts [6]

⇒ no user will have direct access to the external data ⇐ they will be limited to only what the admin allows through OneLake security [6]
⇐ once the data is consolidated, it can be combined with the hub-and-spoke model to create a composite architecture that keeps both upstream and downstream data safe [6]

{feature} shortcut caching

{def} mechanism used to reduce egress costs associated with cross-cloud data access [1]

when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]

subsequent read requests are served from cache rather than the remote storage provider [1]
cached files have a retention period of 24 hours
each time the file is accessed the retention period is reset [1]
if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]

{restriction} individual files greater than 1 GB in size are not cached [1]
{restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]

{feature} query acceleration

caches data as it lands in OneLake, providing performance comparable to ingesting data in Eventhouse [4]

{limitation} maximum number of shortcuts [1]

per Fabric item: 100,000
in a single OneLake path: 10
direct shortcuts to shortcut links: 5

{limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
{limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
{limitation} shortcuts don't support non-Latin characters[1]
{limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
{limitation} copy function doesn't work on shortcuts that directly point to ADLS containers

{recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]

{limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
{limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
{limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]
introduce unique considerations when it comes to security [6]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Fabric: OneLake shortcuts [link]
[2] Microsoft Learn (2024) Fabric Analyst in a Day [course notes]

[3] Microsoft Learn (2024) Use OneLake shortcuts to access data across capacities: Even when the producing capacity is paused! [link]

[4] Microsoft Learn (2024) Fabric: Query acceleration for OneLake shortcuts - overview (preview) [link]

[5] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

[6] Microsoft Fabric Update Blog (2025) Understanding OneLake Security with Shortcuts [link]

Acronyms:

ADLS - Azure Data Lake Storage
API - Application Programming Interface

AWS - Amazon Web Services

GCS - Google Cloud Storage

KQL - Kusto Query Language

OPDG - on-premises data gateway

SQL Troubles

Pages

15 March 2025

🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

About Me