SQL Troubles: 🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

15 March 2025

🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Microsoft Fabric] Azure Data Lake Storage Gen2 (ADLS Gen2)

{def} a set of capabilities built on Azure Blob Storage and dedicated to big data analytics [27]
{capability} low-cost storage

provides cloud storage that is less expensive than the cloud storage that relational databases provide [25]

{capability} performant massive storage

designed to service multiple petabytes of information while sustaining hundreds of gigabits of throughput [27]

processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels [27]

{capability} tiered storage
{capability} high availability/disaster recovery
{capability} file system semantics
{capability} file-level security
{capability} scalability
supports a range of tools and programming languages that enable large amounts of data to be reported on, queried, and transformed [25]
included in customer subscriptions

customers can bring their data lake and integrate it with D365FO

combines BYOD and Entity store [25]
entity store

staged in the data lake and provides a set of simplified (denormalized) data structures to make reporting easier [25]

users can be given direct access to the relevant data and can create reports [25]

instead of exporting data by using BYOD, customers can select the data that is staged in the data lake

the data in the data lake are synchronized via the D365FO data feed service [25]
reflects the updated D365FO data within a few minutes after the changes occur [25]

further data can be brought into the data lake
the data can be consumed via cloud-based services [25]
{benefit} no need to monitor and manage complex data export and orchestration schedules [25]
{benefit} no user intervention is required to update data in the data lake [25]
o{benefit} cost effective data storage

[licensing] included in a customer's subscription

the customer must pay for

data storage
I/O costs that are incurred when data is read and written to the data lake [25]
I/O costs because D365FO apps write data to the data lake or update the data in it [25]

{prerequisite} the data lakes must be provisioned in the same country or region as the D365 environment [25]

the stored data comply with the CMD folder standard
integration makes warm-path reporting available as the default reporting option

because data in a data lake is updated within minutes [25]
approach acceptable for most reporting scenarios (incl. near-real-time reporting) [25]

{benefit} designed to store large amounts of data [25]
{benefit} designed for big data analytics [25]
{benefit} has many associated services that enable analytics, data transformation, and the application of AI and machine learning [25]

hierarchical namespace

organizes objects/files into a hierarchy of directories for efficient data access [27]
renaming or deleting a directory become single atomic metadata operations on the directory [28]
no need to enumerate and process all objects that share the name prefix of the directory [28]

{feature} query acceleration

enables applications and analytics frameworks to significantly optimize data processing by retrieving only the data that they require to perform a given operation [28]
accepts filtering predicates and column projections which enable applications to filter rows and columns at the time that data is read from disk [28]

only the data that meets the conditions of a predicate are transferred over the network to the application [28]
reduces network latency and compute cost [28]

a query acceleration request is specified via SQL

a request processes only one file

advanced relational features of SQL aren't supported [28]

e.g. joins and group by aggregates,

supports CSV and JSON formatted data as input to each request
compatible with the blobs in storage accounts that don't have a hierarchical namespace enabled on them [28]
designed for distributed analytics frameworks

e.g. Apache Spark and Apache Hive

these engines include query optimizers that can incorporate knowledge of the underlying I/O service's capabilities when determining an optimal query plan for user queries [28]
these frameworks integrate query acceleration

⇒ users see improved query latency and a lower total cost of ownership without having to make any changes to the queries

includes a storage abstraction layer within the framework [28]

designed for data processing applications that perform large-scale data transformations that might not directly lead to analytics insights [28]

Previous Post <<||>> Next Post

References
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link]

[27] Introduction to Azure Data Lake Storage Gen2 [link]

[28] Azure Data Lake Storage query acceleration [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output

JSON - JavaScript Object Notation)

No comments:

Subscribe to: Post Comments (Atom)