Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.
Last updated: 21-Jan-2025
[Microsoft Fabric] Azure Data Lake Storage Gen2 (ADLS Gen2)
- {def} a set of capabilities built on Azure Blob Storage and dedicated to big data analytics [27]
- {capability} low-cost storage
- provides cloud storage that is less expensive than the cloud storage that relational databases provide [25]
- {capability} performant massive storage
- designed to service multiple petabytes of information while sustaining hundreds of gigabits of throughput [27]
- processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels [27]
- {capability} tiered storage
- {capability} high availability/disaster recovery
- {capability} file system semantics
- {capability} file-level security
- {capability} scalability
- supports a range of tools and programming languages that enable large amounts of data to be reported on, queried, and transformed [25]
- included in customer subscriptions
- customers can bring their data lake and integrate it with D365FO
- combines BYOD and Entity store [25]
- entity store
- staged in the data lake and provides a set of simplified (denormalized) data structures to make reporting easier [25]
- users can be given direct access to the relevant data and can create reports [25]
- instead of exporting data by using BYOD, customers can select the data that is staged in the data lake
- the data in the data lake are synchronized via the D365FO data feed service [25]
- reflects the updated D365FO data within a few minutes after the changes occur [25]
- further data can be brought into the data lake
- the data can be consumed via cloud-based services [25]
- {benefit} no need to monitor and manage complex data export and orchestration schedules [25]
- {benefit} no user intervention is required to update data in the data lake [25]
- o{benefit} cost effective data storage
- [licensing] included in a customer's subscription
- the customer must pay for
- data storage
- I/O costs that are incurred when data is read and written to the data lake [25]
- I/O costs because D365FO apps write data to the data lake or update the data in it [25]
- {prerequisite} the data lakes must be provisioned in the same country or region as the D365 environment [25]
- the stored data comply with the CMD folder standard
- integration makes warm-path reporting available as the default reporting option
- because data in a data lake is updated within minutes [25]
- approach acceptable for most reporting scenarios (incl. near-real-time reporting) [25]
- {benefit} designed to store large amounts of data [25]
- {benefit} designed for big data analytics [25]
- {benefit} has many associated services that enable analytics, data transformation, and the application of AI and machine learning [25]
- hierarchical namespace
- organizes objects/files into a hierarchy of directories for efficient data access [27]
- renaming or deleting a directory become single atomic metadata operations on the directory [28]
- no need to enumerate and process all objects that share the name prefix of the directory [28]
- {feature} query acceleration
- enables applications and analytics frameworks to significantly optimize data processing by retrieving only the data that they require to perform a given operation [28]
- accepts filtering predicates and column projections which enable applications to filter rows and columns at the time that data is read from disk [28]
- only the data that meets the conditions of a predicate are transferred over the network to the application [28]
- reduces network latency and compute cost [28]
- a query acceleration request is specified via SQL
- a request processes only one file
- advanced relational features of SQL aren't supported [28]
- e.g. joins and group by aggregates,
- supports CSV and JSON formatted data as input to each request
- compatible with the blobs in storage accounts that don't have a hierarchical namespace enabled on them [28]
- designed for distributed analytics frameworks
- e.g. Apache Spark and Apache Hive
- these engines include query optimizers that can incorporate knowledge of the underlying I/O service's capabilities when determining an optimal query plan for user queries [28]
- these frameworks integrate query acceleration
- ⇒ users see improved query latency and a lower total cost of ownership without having to make any changes to the queries
- includes a storage abstraction layer within the framework [28]
- designed for data processing applications that perform large-scale data transformations that might not directly lead to analytics insights [28]
References
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link]
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link]
[27] Introduction to Azure Data Lake Storage Gen2 [link]
[28] Azure Data Lake Storage query acceleration [link]
Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
Acronyms:
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output
JSON - JavaScript Object Notation)