Showing posts with label OneLake. Show all posts
Showing posts with label OneLake. Show all posts

12 March 2024

Microsoft Fabric: OneLake (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 12-Mar-2024

Microsoft Fabric & OneLake
Microsoft Fabric & OneLake

OneLake

  • a single, unified, logical data lake for the whole organization [2]
    • designed to be the single place for all an organization's analytics data [2]
    • provides a single, integrated environment for data professionals and the business to collaborate on data projects [1]
    • stores all data in a single open format [1]
    • its data is governed by default
    • combines storage locations across different regions and clouds into a single logical lake, without moving or duplicating data
      • similar to how Office applications are prewired to use OneDrive
      • saves time by eliminating the need to move and copy data 
  • comes automatically with every Microsoft Fabric tenant [2]
    • automatically provisions with no extra resources to set up or manage [2]
    • used as native store without needing any extra configuration [1
  • accessible by all analytics engines in the platform [1]
    • all the compute workloads in Fabric are preconfigured to work with OneLake
      • compute engines have their own security models (aka compute-specific security) 
        • always enforced when accessing data using that engine [3]
        • the conditions may not apply to users in certain Fabric roles when they access OneLake directly [3]
  • built on top of ADLS  [1]
    • supports the same ADLS Gen2 APIs and SDKs to be compatible with existing ADLS Gen2 applications [2]
    • inherits its hierarchical structure
    • provides a single-pane-of-glass file-system namespace that spans across users, regions and even clouds
  • data can be stored in any format
    • incl. Delta, Parquet, CSV, JSON
    • data can be addressed in OneLake as if it's one big ADLS storage account for the entire organization [2]
  • uses a layered security model built around the organizational structure of experiences within MF [3]
    • derived from Microsoft Entra authentication [3]
    • compatible with user identities, service principals, and managed identities [3]
    • using Microsoft Entra ID and Fabric components, one can build out robust security mechanisms across OneLake, ensuring that you keep your data safe while also reducing copies and minimizing complexity [3]
  • hierarchical in nature 
    • {benefit} simplifies management across the organization
    • its data is divided into manageable containers for easy handling
    • can have one or more capacities associated with it
      • different items consume different capacity at a certain time
      • offered through Fabric SKU and Trials
  • {component} OneCopy
    • allows to read data from a single copy, without moving or duplicating data [1]
  • {concept} Fabric tenant
    • a dedicated space for organizations to create, store, and manage Fabric items.
      • there's often a single instance of Fabric for an organization, and it's aligned with Microsoft Entra ID [1]
        • ⇒ one OneLake per tenant
      • maps to the root of OneLake and is at the top level of the hierarchy [1]
    • can contain any number of workspaces [2]
  • {concept} capacity
    • a dedicated set of resources that is available at a given time to be used [1]
    • defines the ability of a resource to perform an activity or to produce output [1]
  • {concept} domain
    • a way of logically grouping together workspaces in an organization that is relevant to a particular area or field [1]
    • can have multiple [subdomains]
      • {concept} subdomain
        • a way for fine tuning the logical grouping of the data
  • {concept} workspace 
    • a collection of Fabric items that brings together different functionality in a single tenant [1]
      • different data items appear as folders within those containers [2]
      • always lives directly under the OneLake namespace [4]
      • {concept} data item
        • a subtype of item that allows data to be stored within it using OneLake [4]
        • all Fabric data items store their data automatically in OneLake in Delta Parquet format [2]
      • {concept} Fabric item
        • a set of capabilities bundled together into a single component [4] 
        • can have permissions configured separately from the workspace roles [3]
        • permissions can be set by sharing an item or by managing the permissions of an item [3]
    • acts as a container that leverages capacity for the work that is executed [1]
      • provides controls for who can access the items in it [1]
        • security can be managed through Fabric workspace roles
      • enable different parts of the organization to distribute ownership and access policies [2]
      • part of a capacity that is tied to a specific region and is billed separately [2]
      • the primary security boundary for data within OneLake [3]
    • represents a single domain or project area where teams can collaborate on data [3]
  • [encryption] encrypted at rest by default using Microsoft-managed key [3]
    • the keys are rotated appropriately per compliance requirements [3]
    • data is encrypted and decrypted transparently using 256-bit AES encryption, one of the strongest block ciphers available, and it is FIPS 140-2 compliant [3]
    • {limitation} encryption at rest using customer-managed key is currently not supported [3]
  • {general guidance} write access
    • users must be part of a workspace role that grants write access [4] 
    • rule applies to all data items, so scope workspaces to a single team of data engineers [4] 
  • {general guidance}Lake access: 
    • users must be part of the Admin, Member, or Contributor workspace roles, or share the item with ReadAll access [4] 
  • {general guidance} general data access 
    • any user with Viewer permissions can access data through the warehouses, semantic models, or the SQL analytics endpoint for the Lakehouse [4] 
  • {general guidance} object level security:
    • give users access to a warehouse or lakehouse SQL analytics endpoint through the Viewer role and use SQL DENY statements to restrict access to certain tables [4]
  • {feature|preview} trusted workspace access
    • allows to securely access firewall-enabled Storage accounts by creating OneLake shortcuts to Storage accounts, and then use the shortcuts in the Fabric items [5]
    • based on [workspace identity]
    • {benefit} provides secure seamless access to firewall-enabled Storage accounts from OneLake shortcuts in Fabric workspaces, without the need to open the Storage account to public access [5]
    • {limitation} available for workspaces in Fabric capacities F64 or higher
  • {concept} workspace identity
    • a unique identity that can be associated with workspaces that are in Fabric capacities
    • enables OneLake shortcuts in Fabric to access Storage accounts that have [resource instance rules] configured
    • {operation} creating a workspace identity
      • Fabric creates a service principal in Microsoft Entra ID to represent the identity [5]
  • {concept} resource instance rules
    • a way to grant access to specific resources based on the workspace identity or managed identity [5] 
    • {operation} create resource instance rules 
      • created by deploying an ARM template with the resource instance rule details [5]
Acronyms:
ADLS - Azure Data Lake Storage
AES - Advanced Encryption Standard 
ARM - Azure Resource Manager
FIPS - Federal Information Processing Standard
SKU - Stock Keeping Units

References:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn (2023) OneLake, the OneDrive for data (link)
[3] Microsoft Learn (2023) OneLake security (link)
[4] Microsoft Learn (2023) Get started securing your data in OneLake (link}
[5] Microsoft Fabric Updates Blog (2024) Introducing Trusted Workspace Access for OneLake Shortcuts, by Meenal Srivastva (link)

Resources:
[1] 


13 February 2024

Business Intelligence: A One Man Show II (In the Cusps of Complexity)

Business Intelligence Series
Business Intelligence Series

I watched today on YouTube Power BI Tips' "One Person to Do Everything" episode I missed last week. The main topic is based on Christopher Laubenthal's article "Why one person can't do everything in the data space". Author's arguments are based on an analogy between the various data areas and a college's functional structure. Reading the article, I must say that it takes a poorly chosen analogy to mess messy things more!

One of the most confusing things is that there are so many data-related context-dependent roles with considerable overlapping, that it becomes more and more difficult to understand what they cover. The author considers the roles of Data Architect, Data Engineer, Database Administrator (DBA), Data Analyst, Information Designer and Data Scientist. However, to the every aspect of a data architecture there are also developers on the database (backend) and reporting side (front-end). Conversely, there are other data professionals on the management side for the various knowledge areas of Data Management: Data Governance, Data Strategy, Data Security, Data Operations, etc. There are also roles at the border between the business and the technical side like Data Stewards, Business Analysts, Data Citizen, etc. 

There are two main aspects here. According to the historical perspective, many of these roles appeared when a new set of requirements or a new layer appeared in the architecture. Firstly, it was maybe the DBA, who was supposed to primarily administer the database. Being a keeper of the data and having some knowledge of the data entities, it was easy for him/her to export data for the various reporting needs. In time such activities were taken over by a second category of data professionals. Then the data were moved to Decision Support Systems and later to Data Warehouses and Data Lakes/Lakehoses, this evolution requiring other professionals to address the challenges of each layer. Every activity performed on the data requires a certain type of knowledge that can result in the end in a new denomination. 

The second perspective results from the management of data and the knowledge areas associated with it. If in small organizations with one or two systems in place one doesn't need to talk about Data Operations, in big organizations, where a data center or something similar is maybe in place, Data Operations can easily become a topic on its own, a management structure needing to be in place for its "effective and efficient" management. And the same can happen in the other knowledge areas and their interaction with the business. It's an inherent tendency of answering to complexity with complexity, which on the long term can be in the detriment of any business. In extremis, organizations tend to have a whole team in each area, which can further increase the overall complexity by a small to not that small magnitude. 

Fortunately, one of the benefits of technological advancement is that much of the complexity can be moved somewhere else, and these are the areas where the cloud brings the most advantages. Parts or all architecture can be deployed into the cloud, being managed by cloud providers and third-parties on an on-demand basis at stable costs. Moreover, with the increasing maturity and integration of the various layers, the impact of the various roles in the overall picture is reduced considerably as areas like governance, security or operations are built-in as services, requiring thus less resources. 

With Microsoft Fabric, all the data needed for reporting becomes in theory easily available in the OneLake. Unfortunately, there is another type of complexity that is dumped on other professionals' shoulders and these aspects need to be furthered considered. 

Previous Post <<|||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)
[2] Power BI tips (2024) Ep.292: One Person to Do Everything (link)


Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.