15 March 2025

🏭🗒️Microsoft Fabric: Azure Data Lake Storage Gen2 (ADLS Gen2) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Microsoft Fabric] Azure Data Lake Storage Gen2 (ADLS Gen2)

  • {def} a set of capabilities built on Azure Blob Storage and dedicated to big data analytics [27]
  • {capability} low-cost storage
    • provides cloud storage that is less expensive than the cloud storage that relational databases provide [25]
  • {capability} performant massive storage
    • designed to service multiple petabytes of information while sustaining hundreds of gigabits of throughput [27]
      • processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels [27]
  • {capability} tiered storage
  • {capability} high availability/disaster recovery
  • {capability} file system semantics
  • {capability} file-level security
  • {capability} scalability
  • supports a range of tools and programming languages that enable large amounts of data to be reported on, queried, and transformed [25]
  • included in customer subscriptions
    • customers can bring their data lake and integrate it with D365FO
  • combines BYOD and Entity store [25]
  • entity store 
    • staged in the data lake and provides a set of simplified (denormalized) data structures to make reporting easier [25]
      • users can be given direct access to the relevant data and can create reports [25]
    • instead of exporting data by using BYOD, customers can select the data that is staged in the data lake
      • the data in the data lake are synchronized via the D365FO data feed service [25]
      • reflects the updated D365FO data within a few minutes after the changes occur [25]
    • further data can be brought into the data lake
    • the data can be consumed via cloud-based services [25]
    • {benefit} no need to monitor and manage complex data export and orchestration schedules [25]
    • {benefit} no user intervention is required to update data in the data lake [25]
    • o{benefit} cost effective data storage
  • [licensing] included in a customer's subscription
    • the customer must pay for 
      • data storage
      • I/O costs that are incurred when data is read and written to the data lake [25]
      • I/O costs because D365FO apps write data to the data lake or update the data in it [25]
  • {prerequisite} the data lakes must be provisioned in the same country or region as the D365 environment [25]
    • the stored data comply with the CMD folder standard
    • integration makes warm-path reporting available as the default reporting option
      • because data in a data lake is updated within minutes [25]
      • approach acceptable for most reporting scenarios (incl. near-real-time reporting) [25]
  • {benefit} designed to store large amounts of data [25]
  • {benefit} designed for big data analytics [25]
  • {benefit} has many associated services that enable analytics, data transformation, and the application of AI and machine learning [25]
    • hierarchical namespace 
      • organizes objects/files into a hierarchy of directories for efficient data access [27]
      • renaming or deleting a directory become single atomic metadata operations on the directory [28]
      • no need to enumerate and process all objects that share the name prefix of the directory [28]
  • {feature} query acceleration 
    • enables applications and analytics frameworks to significantly optimize data processing by retrieving only the data that they require to perform a given operation [28]
    • accepts filtering predicates and column projections which enable applications to filter rows and columns at the time that data is read from disk [28]
      • only the data that meets the conditions of a predicate are transferred over the network to the application [28]
      • reduces network latency and compute cost [28]
        • a query acceleration request is specified via SQL
      • a request processes only one file
    • advanced relational features of SQL aren't supported [28] 
      • e.g. joins and group by aggregates,
    • supports CSV and JSON formatted data as input to each request
    • compatible with the blobs in storage accounts that don't have a hierarchical namespace enabled on them [28]
    • designed for distributed analytics frameworks 
      • e.g. Apache Spark and Apache Hive
        • these engines include query optimizers that can incorporate knowledge of the underlying I/O service's capabilities when determining an optimal query plan for user queries [28]
        • these frameworks integrate query acceleration
          • ⇒ users see improved query latency and a lower total cost of ownership without having to make any changes to the queries 
      • includes a storage abstraction layer within the framework  [28]
    • designed for data processing applications that perform large-scale data transformations that might not directly lead to analytics insights [28]


References
[25] Azure Synapse Analytics (2020) Azure Data Lake overview [link
[27] Introduction to Azure Data Lake Storage Gen2 [link
[28] Azure Data Lake Storage query acceleration [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
AI - Artificial Intelligence
BYOD - Bring Your Own Device
CSV - Comma-Separated Values
I/O - Input/Output
JSON - JavaScript Object Notation)

No comments:

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.