Showing posts with label notes. Show all posts
Showing posts with label notes. Show all posts

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 8-Dec-2024

Microsoft Fabric 

  • {goal}complete (end-to-end) analytics platform [6]
    • {characteristic} unified
      • {objective} provides a single, integrated environment for all the organization
        • {benefit} data professionals and the business users can collaborate on data projects [5] and solutions
    • {characteristic}serverless SaaS model (aka SaaS-ified)
      • {objective} provisioned automatically with the tenant [6]
      • {objective} highly scalable [5]
      • {objective} cost-effectiveness [5]
      • {objective} accessible 
        • ⇐ from anywhere with an internet connection [5]
      • {objective} continuous updates
        • ⇐ provided by Microsoft
      • {objective} continuous maintenance 
        • ⇐ provided by Microsoft
      • provides a set of integrated services that enable to ingest, store, process, and analyze data in a single environment [5]
    • {objective} secure
    • {objective} governed
  • {goal} lake-centric
    • {characteristic} OneLake-based
      • all workloads automatically store their data in the OneLake workspace folders [6]
      • all the data is organized in an intuitive hierarchical namespace [6]
      • data is automatically indexed [6]
      • provides a set of features 
        • discovery
        • MIP labels
        • lineage
        • PII scans
        • sharing
        • governance
        • compliance
    • {characteristic} one copy
      • available for all computes 
      • all compute engines store their data automatically in OneLake
        •  the data is stored in a (single) common format
          •  delta parquet file format
            • open standards format
            • the storage format for all tabular data in Microsoft Fabric 
        • ⇐ the data is directly accessible by all the engines [6]
          • ⇐ no import/export needed
      • all compute engines are fully optimized to work with Delta Parquet as their native format [6]
      • a shared universal security model is enforced across all the engines [6]
    • {characteristic} open at every tier
  • {goal} empowering
    • {characteristic} intuitive
    • {characteristic} built into M365
    • {characteristic} insight to action
  • {goal} AI-powered
    • {characteristic} Copilot accelerated 
    • {characteristic} ChatGPT enabled
    • {characteristic} AI-driven insights
  •  complete analytics platform
    • addresses the needs of all data professionals and business users who target harnessing the value of data 
  • {feature} scales automatically
    • the system automatically allocates an appropriate number of compute resources based on the job size
    • the cost is proportional to total resource consumption, rather than size of cluster or number of resources allocated 
    •  jobs in general complete faster (and usually, at less overall cost)
      • ⇒ not need to specify cluster sizes
  • natively supports 
    • Spark
    • data science
    • log-analytics
    • real-time ingestion and messaging
    • alerting
    • data pipelines, and 
    • Power BI reporting 
    • interoperability with third-party services 
      • from other vendors that support the same open 
  • data virtualization mechanisms 
    • {feature} mirroring [notes]
    • {feature} shortcuts [notes]
      • allow users to reference data without copying it
      • {benefit} make other domain data available locally without the need for copying data
  • {feature} tenant (aka Microsoft Fabric tenantMF tenant)
    • a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
    • can contain any number of workspaces
  • {feature} workspaces
    • {definition} a collection of items that brings together different functionality in a single environment designed for collaboration
    • associated with a domain [3]
  • {feature} domains [notes]
    • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • subdomains
      • a way for fine tuning the logical grouping data under a domain [1]
        • subdivisions of a domain

Acronyms:
API - Application Programming Interface
M365 - Microsoft 365
MF - Microsoft Fabric
PII - Personal Identification Information
SaaS - software-as-a-service

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric [link]
[2] Microsoft Learn: Fabric (2024) Governance overview and guidance [link]
[3] Microsoft Learn: Fabric (2023) Fabric domains [link]
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam [link]
[5] Microsoft Learn: Fabric (2024) Introduction to end-to-end analytics using Microsoft Fabric [link]
[6] 
Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 1-Apr-2024

Delta Lake 

  • {definition} an optimized open source storage layer that runs on top of a data lake [1]
    • the default storage format in a Fabric lakehouse [1]
    • stores data in Parquet file format
    • is a variant of log-structured files 
    • initially developed at Databricks
    • fully compatible with Apache Spark APIs [1]
  • {characteristic} high reliability
  • {characteristic} secure
  • {characteristic} performant
    • provides low latency
  • {feature} data indexing
    • indexes are created and maintained on the ingested data  [1]
      • increases the querying speed significantly [1]
  • {feature} data skipping
    • file statistics are maintains so that data subsets relevant to the query are used instead of entire tables - this partition pruning avoids processing data that is not relevant to the query [1]
    • helps complex queries to read only the relevant subsets to fulfil query [1]
  • {feature} multidimensional clustering
    • uses the Z-ordering algorithm
    • enables data skipping this.
  • {feature} compaction
    • compacts or combines multiple small files into more efficient larger ones [1]
      • speeds up query performance
    • storing and accessing  small files can be processing-intensive, slow and inefficient from a storage utilization perspective [1]
  • {feature} data caching
    • highly accessed data is automatically cached to speed access for queries
  • {feature} ACID transactions
    • "all or nothing" ACID transaction approach is employed to prevent data corruption
      • ⇐ partial or failed writes risk corrupting the data [1]
  • {feature} snapshot isolation (aka SI)
    • ensures that multiple writers can write to a dataset simultaneously without interfering with jobs reading the dataset [1]
  • {feature} schema enforcement
    • data can be stored using a schema
    • {benefit} helps ensure data integrity for ingested data by providing schema enforcement [1]
      • potential data corruption with incorrect or invalid schemas is avoided [1]
  • {feature} checkpointing 
    • employed to provide a robust exactly once delivery semantic [1]
    • {benefit} ensures that data is neither missed nor repeated erroneously [1]
  • {feature} UPSERTS and DELETES support
    • provide a more convenient way of dealing with such changes  [1]
  • {feature} unified streaming and batch data processing
    • both batch and streaming data are handled via a direct integration with Structured Streaming for low latency updates  [1]
      • {benefit} simplifies the system architecture  [1]
      • {benefit} results in shorter time from data ingest to query result  [1]
    • can concurrently write batch and streaming data to the same data table [1]
  • {feature} schema evolution
    •  schema is inferred from input data
    • {benefit} reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/data stack [1]
  • {feature} scalable metadata handling
  • {feature} predictive optimization
    • removes the need to manually manage maintenance operations for delta tables [8]
    • {enabled} automatically identifies tables that would benefit from maintenance operations, and then optimizes their storage
  • {feature} historical retention
    • {default} maintains a history of all changes made [4]
    • {benefit} enhanced regulatory compliance and audit
    • {recommendation} keep historical data only for a certain period of time to reduce storage costs [4]
  • {feature} time travel
    • {benefit} support for data rollback 
    • {benefit} lets users query point-in-time snapshots [5]
  • {best practice} all writes and reads should go through Delta Lake [1]
    • {benefit} ensure consistent overall behavior [1]
  • {best practice}run OPTIMIZE Regularly
    • {exception} should not be run on base or staging tables [1]
  • {best practice} run VACUUM Regularly
    • cleans up expired snapshots that are no longer required [1]
  • {best practice} use MERGE INTO to batch changes 
    • {benefit} allows to efficiently rewrite queries to implement updates to archived data and compliance workflows [5]
  • {best practice} use DELETE commands
    • {benefit} ensures proper progression of the change [1] 
    • {warning} manually deleting files from the underlying storage is likely to break the table [1]

Acronyms:
ACID - atomicity, consistency, isolation, durability

References:
[1] Azure Databricks (2023) Delta Lake on Azure Databricks
[2] Josep Aguilar-Saborit et al, POLARIS: The Distributed SQL Engine in Azure Synapse, PVLDB, 13(12), 2020 (link)
[3] Josep Aguilar-Saborit et al, Extending Polaris to Support Transactions 2024
[4] Implement medallion lakehouse architecture in Microsoft Fabric (link)
[5] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[6] 
Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 8-Dec-2024

[Microsoft Fabric] Shortcut

  • {def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access
    • {goal} unifies existing data without copying or moving it [2]
      • ⇒ data can be used multiple times without being duplicated [2]
      • {benefit} helps to eliminate edge copies of data [1]
      • {benefit} reduces process latency associated with data copies and staging [1]
    • is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]
      • ⇒ allows creating a single virtual data lake for the entire enterprise [1]
      • ⇐ available in all Fabric experiences [1]
      • ⇐ behave like symbolic links [1]
    • independent object from the target [1]
    • appear as folder [1]
    • can be used by workloads or services that have access to OneLake [1]
    • transparent to any service accessing data through the OneLake API [1]
    • can point to 
      • OneLake locations
      • ADLS Gen2 storage accounts
      • Amazon S3 storage accounts
      • Dataverse
      • on-premises or network-restricted locations via PDF 
  • {capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
  • {capability} data can be compose throughout OneLake without any data movement [2]
  • {capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]
    • ⇐ makes OneLake the first multi-cloud data lake [2]
  • {capability} provides support for industry standard APIs
    • ⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]
  • {operation} creating a shortcut
    • can be created in 
      • lakehouses
      • KQL databases
        • ⇐ shortcuts are recognized as external tables [1]
    • can be created via 
      • Fabric UI 
      • REST API
    • can be created across items [1]
      • the item types don't need to match [1]
        • e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]
    • [lakehouse] tables folder
      • represents the managed portion of the lakehouse 
        • shortcuts can be created only at the top level [1]
          • ⇒ shortcuts aren't supported in other subdirectories [1]
        • if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]
    • [lakehouse] files folder
      • represents the unmanaged portion of the lakehouse [1]
      • there are no restrictions on where shortcuts can be created [1]
        • ⇒ can be created at any level of the folder hierarchy [1]
        • ⇐table discovery doesn't happen in the Files folder [1]
  • {operation} renaming a shortcut
  • {operation} moving a shortcut
  • {operation} deleting a shortcut 
    • doesn't affect the target [1]
      • ⇐ only the shortcut object is deleted [1]
        • ⇐ the shortcut target remains unchanged [1]
    • shortcuts don't perform cascading deletes [1]
    • moving, renaming, or deleting a target path can break the shortcut [1]
  • {operation} delete file/folder
    • file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]
  • {permissions} users must have permissions in the target location to read the data [1]
    • when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1]
    • when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]
      •  the calling item owner’s identity is passed instead, delegating access to the calling user [1]
    • OneLake manages all permissions and credentials
  • {feature} shortcut caching 
    • {def} mechanism used to reduce egress costs associated with cross-cloud data access [1]
      • when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]
        • subsequent read requests are served from cache rather than the remote storage provider [1]
        • cached files have a retention period of 24 hours
        • each time the file is accessed the retention period is reset [1]
        • if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
        • if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]
    • {restriction} individual files greater than 1GB in size are not cached [1]
    • {restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]
  • {limitation} maximum number of shortcuts [1] 
    • per Fabric item: 100,000
    • in a single OneLake path: 10
    • direct shortcuts to shortcut links: 5
  • {limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
  • {limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
  • {limitation} shortcuts don't support non-Latin characters[1]
  • {limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
  • {limitation} copy function doesn't work on shortcuts that directly point to ADLS containers
    • {recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]
  • {limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
  • {limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
  • {limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]
References:
[1] Microsoft Fabric (2024) OneLake shortcuts [link]
[2] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

Acronyms:
ADLS - Azure Data Lake Storage
AWS - Amazon Web Services
GCS - Google Cloud Storage
KQL - Kusto Query Language
OPDG - on-premises data gateway

10 November 2024

🏭🗒️Microsoft Fabric: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 23-May-2024

[Microsoft Fabric] Data Mesh
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    •   a centrally managed network of decentralized data products
  • {concept} landing zone
    • typically a subscription that needs to be governed by a common policy [7]
      • {downside} creating one landing zone for every project can lead to too many landing zones to manage
        • {alternative} landing zones based on a business domain [7] 
    •  resources must be managed efficiently in a way that each team is given access to only their resources [7]
      •   shared resources might be need with separate management and common access to all [7]
    • need to be linked together into a mesh
      • via peer-to-peer networks
  • {concept} connectivity hub
  • {feature} resource group
    • {definition} a container that holds related resources for an Azure solution 
    • can be associated with a data product
      • when the data product becomes obsolete, the resource group can be deleted [7]
  • {feature} subscription
    • {definition} a logical unit of Azure services that are linked to an Azure account
    • can be associated as a landing zone governed by a policy [7]
  • {feature} tenant (aka Microsoft Fabric tenantMF tenant)
    • a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
    • can contain any number of workspaces
  • {feature} workspaces
    • {definition} a collection of items that brings together different functionality in a single environment designed for collaboration
    • associated with a domain [3]
  • {feature} domains
    • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • some tenant-level settings for managing and governing data can be delegated to the domain level [2]
  • {feature} subdomains
    • a way for fine tuning the logical grouping data under a domain [1]
    • subdivisions of a domain
  • {concept} deployment template

References
[1] Microsoft Learn: Fabric (2023) Fabric domains (link)
[2] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link
[3] Data mesh: A perspective on using Azure Synapse Analytics to build data products, by Amanjeet Singh (link)
[4] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale
[5] Marthe Mengen (2024) How do you set up your Data Mesh in Microsoft Fabric? (link)
[6] Administering Microsoft Fabric - Considering Data Products vs Domains vs Workspaces, by Paul Andrew (link)
[7] Aniruddha Deswandikar (2024) Engineering Data Mesh in Azure Cloud

🏭🗒️Microsoft Fabric: Data Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric
Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Data Warehouse

  • highly available relational data warehouse that can be used to store and query data in the Lakehouse
    • supports the full transactional T-SQL capabilities 
    • modernized version of the traditional data warehouse
  • unifies capabilities from Synapse Dedicated and Serverless SQL Pools
  • modernized with key improvements
  • resources are managed elastically to provide the best possible performance
    • ⇒ no need to think about indexing or distribution
    • a new parser gives enhanced CSV file ingestion time
    • metadata is now cached in addition to data
    • improved assignment of compute resources to milliseconds
    • multi-TB result sets are streamed to the client
  • leverages a distributed query processing engine
    • provides with workloads that have a natural isolation boundary [3]
      • true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]
  • {operation} data ingestion
    • involves moving data from source systems into the data warehouse [2]
      • the data becomes available for analysis [1]
    • via Pipelines, Dataflows, cross-database querying, COPY INTO command
    • no need to copy data from the lakehouse to the data warehouse [1]
      • one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]
  • {operation} data storage
    • involves storing the data in a format that is optimized for analytics [2]
  • {operation} data processing
    • involves transforming the data into a format that is ready for consumption by analytical tools [1]
  • {operation} data analysis and delivery
    • involves analyzing the data to gain insights and delivering those insights to the business [1]
  • {operation} designing a warehouse (aka warehouse design)
    • standard warehouse design can be used
  • {operation} sharing a warehouse (aka warehouse sharing)
    • a way to provide users read access to the warehouse for downstream consumption
      • via SQL, Spark, or Power BI
    • the level of permissions can be customized to provide the appropriate level of access
  • {feature} mirroring 
    • provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric
      • any database can be accessed and managed centrally from within Fabric without having to switch database clients
      • data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload
  • {concept}SQL analytics endpoint 
    • a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]
  • {concept}virtual warehouse
    • can containing data from virtually any source by using shortcuts [3]
  • {concept} cross database querying 
    • enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]
References:
[1] Microsoft Learn: Fabric (2023) Get started with data warehouses in Microsoft Fabric (link
[2] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: choose a data store (link)
[3] Microsoft Learn: Fabric (2024) What is data warehousing in Microsoft Fabric? (link)
[4] Microsoft Learn: Fabric (2023) Better together: the lakehouse and warehouse (link)

Resources:
[1] Microsoft Learn: Fabric (2023) Data warehousing documentation in Microsoft Fabric (link)


15 June 2024

🗒️Graphical Representation: Bar & Column Charts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 15-Jun-2024

Bar & Column Charts with Variations
Bar & Column Charts (Graphs) 

  • {definition} graphical representation of categorical data with rectangular figures (aka boxes) whose heights (column chart) or lengths (bar chart) are proportional to the values that they represent
  • {benefit} allow to visually encode/decode quantitative information-size as magnitude and area based on the relative position of the end of the box along the common scale
    • if the width of the box is the same, it's enough to compare the length
      • ⇒ the basis of comparison is one-dimensional [1]
      • ⇐ orient the reader to the relative magnitudes of the boxes
    • area is typically encoded when the width varies
      • ⇐ encoding by area is a poor encoding method as it can mislead
    • can represent negative and positive values 
    • one of the most useful, simple, and adaptable techniques in graphic presentation [1]
      • easily understood by readers
      • sometimes avoided because they are so common
      • almost everything could be a bar chart
    • the length of each bar is proportional to the quantity or amount of each category represented [1]
      • ⇒the zero line must be shown [1]
      • ⇒the scale must not be broken [1]
        • {exception} an excessively long bar in a series of bars may be broken off at the end, and the amount involved shown directly beyond it [1]
  • {benefit} allow to visually represent categorical data
    • ⇒ occasionally represented without scales, grid lines or tick marks
    • the more data elements are presented, the more difficult it becomes to navigate and/or display the data
  • {benefit} allow us to easily compare magnitudes 
    • sometimes without looking at the actual values
  • {type} bar chart
    • the box is shown horizontally
    • represents magnitude by length
    • allows comparing different items as of a specific time
  • {type} column chart
    • the box is shown vertically
    • represents magnitude by height
    • allows comparing different items over time
      • ⇐ it still displays discrete points
    • recommended for comparing similar items for different time periods [2]
    • effective way to show most types of comparisons [2]
  • {subtype} stacked chart
    • variation of bar/column charts in which the boxes of a dimension's components are staked over each other
      • {exception} spaces can be used between boxes if the values aren't cumulative [3]
    • {benefit} allows encoding a further dimension where the values are staked within the same box
    • {drawback} do not show data structure well
      • ⇒ make it challenging to compare values across boxes
  • {subtype} 100-percent chart
    • variation of stacked chart in which the magnitude totals to 100%
    • {benefit} allows to display part to whole relationships
      • ⇐ preferable to circle chart's angle and area comparison [1]
  • {subtype} clustered chart (aka grouped chart)
    • variation of bar/column charts that allows encoding further quantitative information in distinct boxes tacked together which occasionally overlap
      • ⇐ if there's space, it is usually kept to a minimum
      • e.g. can be used to display multiple data series 
    • can be used with a secondary axis
    • {benefit} allows comparisons within the cluster/group as well between clusters/groups
    • {drawback} more challenging to make comparisons across points
  • {subtype} area chart (variable-width/variwide chart/graph
    • variation of bar/column charts in which the height/width have significance being proportional to some measure or characteristics of the data elements represented [3]
    • {benefit} allow encoding a further dimension as part of the area
  • {subtype} deviation chart 
    • variation of bar/column charts that display positive and negative values 
  • {subtype} joined chart
    • variation of bar/column charts in which the boxes are tacked together
    • {benefit} allow to better use the space available 
  • {subtype} paired chart 
    • variation of bar/column charts in which the boxes are paired in mirror based on an axis
      • e.g. the values of one data series are displayed to the left, while the values for a second data series are displayed to the right 
    • {benefit} allows to study the correlation and/or other relationships between the values of two data series
    • the hidden axes can have different scales 
  • {subtype} circular chart (aka radial chart)
    • variation of bar/column charts in which the boxes are wrapped into a circle, the various categories being uniformly spaced along the radial or category axis [3]
    • the value scale can have any upper or lower value and can progress in either direction [3]
    • {benefit} useful to represent data that have a circular dimension in an aesthetic form
      • e.g. months, hours
  • {subtype} waterfall chart (aka progressing chart)
    • variation of bar/column charts in which the boxes are displayed progressively, the start of a box corresponding the end of the previous box 
    • time and activity charts can be considered as variations of this subtype [3]
    • {advantage} allows to determine cumulative values, respectively the increase/decrease between consecutive boxes
  • {subtype}composite chart (aka mixed chartcombination chart, overlay chart)
    • variation of bar/column charts in besides boxes are used other graphic types of encoding (line, area)
      • ⇐ the different data graphics are overlaid on one another [3]
    • {benefit} allows to improve clarity or highlight the relationships between several data series [3]
    • {drawback} overlaying can result in clutter 
  • used to  
    • display totals, averages or frequencies
    • display time series
    • display the relationship between two or more items
    • make a comparison among several items
    • make a comparison between parts and the whole
  • can be confounded with 
    • [histograms]
      • show distribution through the frequency of quantitative values against defined intervals of quantitative values
      • used for continuous numerical data or data that can be effectively modelled as continuous
      • it doesn't have spaces between bars
        • ⇐ older use of bar/column charts don't use spaces
        • if this aspect is ignored, histograms can be considered as a special type of area chart
    • [vertical line chart] (aka price chart, bar chart)
      • vertical line charts are sometimes referred as bar charts (see [3])
  • things to consider
    • distance between bars
      • the more distant the bars, the more difficult it becomes to make comparisons and the accuracy of judgment decreases
    • sorting
      • sorting the bars/columns by their size facilitates comparisons, though it can impede items' search, especially when there are many categories involved
        • {exception} not recommended for time series
    • clutter
      • displaying too many items in a cluster and/or too many labels can lead to clutter
      • {recommendation} display at maximum 3-4 clustered boxes
    • color
      • one should follow the general recommendations 
    • trend lines
      • can be used especially with time series especially to represent the linear regression line
    • dual axis
      • {benefit} allows to compare the magnitudes of two data series by employing a secondary axis
    • overlapping
      • overlapping boxes can make charts easier to read
    • symbols
      • can be used to designate reference points of comparison for each of the bars [3]
  • {alternative} pie chart
    • can be used to dramatize comparisons in relation to the whole [2]
    • one should consider the drawbacks 
  • {alternative} choropleth maps
    • more adequate for geographical dimensions
    • provide minimal encoding 
  • {alternative} line charts
    • can be much more informative
    • provides an optimal dat-ink ratio
    • reduces the chart junk feeling
  • {alternative} dot plots
    • are closer to the original data

References:
[1] Anna C Rogers (1961) "Graphic Charts Handbook"
[2] Robert Lefferts (1981) "Elements of Graphics: How to prepare charts and graphs for effective reports"
[3] Robert L Harris (1996) "Information Graphics: A Comprehensive Illustrated Reference"

23 May 2024

🏭🗒️Microsoft Fabric: Domains [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 29-May-2024

Domains & Entities

[Microsoft Fabric] Domains

  • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • associated with workspaces 
      • {benefit} allows to group data into business domains [1]
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [1]
      • {benefit} enables a better consumption experience [1]
        • simplify discovery and consumption 
  • provide a management boundary between tenant and workspace enabling domain admins to have more granular control over multiple workspaces [6]
    • some tenant-level settings for managing and governing data can be delegated to the domain level [2]
  • allow to achieve federated governance [7]
    • by delegating settings to domain admins
      • ⇒ allow provide more granular control over business area [7]
  • [security] domain roles
    • Fabric admins (or higher)
      • can create and edit domains
      • can specify domain admins and contributors
      • can associate workspaces with domains [4]
      • can see, edit, and delete all domains in the admin portal [4]
      • domain admins
    • business owners or experts of a domain
      • can update the domain description
      • can define contributors
      • can associate workspaces with the domain [4]
      • can define and update the domain image
      • can override tenant settings for any specific settings the tenant admin has delegated to the domain level [4]
      • can't delete the domain, change the domain name, or add/delete other domain admins
      • can only see and edit the domains they're admins of.
    • domain contributors
      • ⇐ must be a workspace admin
      • can associate their workspaces with a domain or change the current domain association
      • don’t have access to the Domains page in the admin portal
    • domain users
      • can share lakehouse with other domain users without giving access to workspace and other artifacts [4]
  • {concept} default domain
    • a domain that has been specified as the default domain for specific users and/or security groups [3]
      • ⇒ when these users/security groups create/update a new/unassigned workspace, that workspace will automatically be assigned to that domain [3]
      • ⇒ generally automatically become domain contributors of the workspaces that are assigned in this manner [3]
  • {feature} subdomains
    • a way for fine tuning the logical grouping data under a domain [1]
      • subdivisions of a domain
      • only one level is supported in the hierarchy
    • visible as part of the domains filter and as part of the item location path
    • no setup available
      • [planned] some domain settings will be added to subdomains as well

Acronyms:
MF - Microsoft Fabric

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link
[5] Microsoft Fabric Updates Blog (2024) Easily implement data mesh architecture with domains in Fabric, by Naama Tsafrir 
(link)
[6] Microsoft (2024) Microsoft Fabric Domains – Data Mesh [with Naama Tsafrir & Assaf Shemesh]
[7] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]


14 May 2024

🗒️Graphical Representation: Sparklines [Notes]

Example Sparklines within Groups

Sparkline

  • {definition} "small, intense, simple, word-sized graphic with typographic resolution" [1]
  • {timeline}
    • [1993] initially considered as 'intense continuous time-series' by Tufte [2]
    • [2006] the term is introduced by Tufte [1]
    • [2009] introduced in Microsoft Excel 2010
    • ⇐ see also [3] for a broader timeline
  • {characteristic} small
    • considered as tiny charts 
  • {characteristic} word-sized graphic (aka word-like)
    • comparable to words and letters
    • their distributions on a page are like sentences and paragraphs [1]
  • {characteristic} inline
    • can be everywhere a word or number can be
    • e.g. embedded in a sentence, table, headline, map, spreadsheet, graphic [1]
  • {characteristic} minimalistic
    • no grid lines 
    • other visual elements are used occasionally, though kept to a minimum 
  • {characteristic} approximate
    • it isn't meant to give precise values, though is precise enough for its scope
  • {characteristic} compact
    • vastly increase the amount of data within readers' eye-span
    • aggregates pattern along with plenty of local detail [1]
    • allows for speed and convenience [1]
  • {characteristic} provides context
    • enables us to put numbers in context 
  • {characteristic} typographic resolution
    • "work at intense resolutions, at the level of good typography and cartography" [1]
  • forms
    • line graph
    • bar chart 
    • win/loss
  • scope
    • show trends in a series of values
    • highlight maximum and minimum values
    • show recent change in relation to past changes
    • make comparisons across the lines and/or within groups
  • supports
    • time series
      • e.g.  seasonal increases or decreases, economic cycles
    • binary data
      • e.g. presence/absence, occurrence/non-occurrence, win/loss [1]
    • multivariate data
      • can simultaneously accommodate several variables
  • {recommendation} position a sparkline near its data for greatest impact

References:
[1] Edward R Tufte (2006) "Beautiful Evidence"
[2] Edward R Tufte (1983) "The Visual Display of Quantitative Information"
[3] Wikipedia (2023) Sparklines (link)

07 May 2024

🏭🗒️Microsoft Fabric: The Metrics Layer [Notes] [new feature]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

  • {definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse
    • ⇐ {important} feature still in private preview 
  • {goal} extend existing infrastructure 
    • {benefit} leverages and extends existing features
  • {goal} provide consistent definitions and descriptions [1]
    • consistent definitions that include besides business logic additional dimensions and filters [1]
    • ⇒ {benefit} allows to standardize the metrics across the organization
    • ⇒ {benefit} enforce to enforce a SSoT
  • {goal} easy management 
    • via management views 
    • [feature] lineage 
    • [feature] source control
    • [feature] duplicate identification
    • [feature] push updates to downstream uses of the metrics 
  • {goal}searchable and discoverable metrics 
    • {feature} integration
      • based on Sempy fabric package
        • ⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric
  • {goal}trust
    • [feature] trust indicators
    • {benefit} facilitates report's adoption
  • {feature} metric set 
    • {definition} a Fabric item that groups together a set of metrics into a mini-model
    • {benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
    • associated with a single domain
      • ⇒ supports the data mesh architecture
    • shareable 
      • can be shared with other users
    • {action} create metric set
      • creates the actual artifact, to which metrics can be added 
  • {feature} metric
    •  {definition} a way to elevate the measures from the various semantic models existing in the organization
    • tied to the original semantic model
      • ⇒ {benefit} allows to see how a metric is used across the solutions 
    • reusable
      • can be reused in other fabric artifacts
        • new reports on the Power BI service
        • notebooks 
          • by copying the code
      • can be reused in Power BI
        • via OneLake data hub menu element
      • can be chained 
        • changes are propagated downstream 
    • materializable 
      • its output can be persisted to OneLake by saving it a delta table into a lakehouse
      • {misuse} data is persisted unnecessarily
    • {action} elevate metric
      • copies measure's definition and description
      • ⇒  implies restructuring, refactoring, moving, and testing a lot of code in the process
      • {misuse} data professionals build everything as metrics
    • {action} update metric
    • {action} add filters to metric 
    • {action} add dimensions to metric
    • {action} materialize metric 

Acronyms:
SSoT - single source of truth ()

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

20 April 2024

⚡️🗒️Power BI: Visual Calculations [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 20-Apr-2024

[feature] Visual Calculations (aka Visual Calcs)

  • {definition} a type of DAX calculation that's defined and executed directly on a visual [1]
  • {benefit} make it easier to create calculations that were previously hard to create
    • leads to simpler DAX, easier maintenance, and better performance [1]
    • reuse the results from its components [2]
  • still in public preview
    • ⇒ must be enabled manually
  • are stored on the visual
    • ⇒ can refer to any data in the visual 
      • incl. columns, measures, or other visual calculations [1]
    • ⇒ anything in the model must be added to the visual before the visual calculation can refer to it [1]
    •  ⇒ they can refer to the visual structure
      • ⇒ leads to more flexibility
  • combine the simplicity of context from calculated columns with the on-demand calculation flexibility from measures [1]
    • ⇐ the context is "visible"
  • operate on aggregated data instead of the detail level [1]
    • ⇒ leads to performance benefits
  • introduce a set of functions specific to visual calculations [1]
    • {category} medium-level functions
      • {function} COLLAPSE
        • the calculation is evaluated at a higher level of the axis [1]
      • {function} COLLAPSEALL
        • the calculation is evaluated at the total level of the axis [1]
      • {function} EXPAND
        • the calculation is evaluated at a lower level of the axis [1]
      • {function}EXPANDALL
        • the calculation is evaluated at the leaf level of the axis [1]
      • {function} FIRST
        • refers to the first row of an axis [1]
      • {function} ISATLEVEL
        • reports whether a specified column is present at the current level.
      • {function} LAST
        • refers to the last row of an axis [1]
      • {function} NEXT
        • refers to a next row of an axis [1]
      • {function} PREVIOUS
        • refers to a previous row of an axis [1]
      • {function} RANGE
        • refers to a slice of rows of an axis [1]
    • {category} high-level functions
      • {function} MOVINGAVERAGE
        • adds a moving average on an axis [1]
      • {function} RUNNINGSUM
        • adds a running sum on an axis  [1]
    • {category} low-level functions
      • ⇐ {exception} are available in standard DAX
      • {function} INDEX
      • {function} OFFSET
      • {function} RANK
      • {function} ROWNUMBER
      • {function}WINDOW
      • {function} ORDERBY
      • {function} PARTITIONBY
      • {function} MATCHBY
  • {default} most of them are evaluated row-by-row [1]
    • ⇐ like a calculated column
    • there's no need to add an aggregation function [1]
      •  it's better not to add such aggregates when they're not necessary [1]
  • {operation} create calculation
    • adds the visual calculation to the visual
  • {operation} hide calculation
    • calculations that aren't needed in the visual can be hidden [2]
  • {operation} copy calculation
    • copies the calculation between visuals and if intermediary steps are not there, they will be copied as well [planned] [2]
  • {feature} templates 
    • ready available calculation constructs 
    • {benefit} make it easier to write common calculations [1]
  • {feature|planned} support for Scanner API [2]
  • {parameter} axis 
    • influences how the visual calculation traverses the visual matrix [1]
    • {default} set to the first axis in the visual
    • {value} ROWS
      • the visual calculation is evaluated row-by-row in the visual matrix, from top to bottom. [1]
    • {value} COLUMNS
      • the visual calculation is evaluated row-by-row in the visual matrix, from left to right [1]
    • {value} ROWS COLUMNS
      • calculates vertically across rows from top to bottom, continuing column by column from left to right [1]
    • {value} COLUMNS ROWS
      • calculates horizontally across columns from left to right, continuing row by row from top to bottom [1]
    • {warning} not all visuals provide all axes, and some visuals provide no axes [1]
  • {parameter| reset 
    • influences if and when the function resets its value to 0 or switches to a different scope while traversing the visual matrix [1]
    • expects there to be multiple levels on the axis [1]
      • ⇐ use PARTITIONBY if there's only one level on the axis [1]
    • {value|default} NONE
      • means the visual calculation is never restarted [1]
    • {value} HIGHESTPARENT 
      • resets the calculation when the value of the highest parent on the axis changes [1]
    • {value} LOWESTPARENT 
      • resets the calculations when the value of the lowest parent on the axis changes [1]
    • {value} numerical value
      • refers to the fields on the axis, with the highest field being one [1]
  • {limitation} functions that rely on model relationships  aren't available
    • e.g. USERELATIONSHIP, RELATED or RELATEDTABLE
  • {limitation} not all visual types are supported [1]
    • ⇐ for the full list of limitations see [1]
  • {limitation} one can't filter on visual calculations [1]
  • {limitation} underlying data can't be exported [1]
  • {limitation} don't support conditional formatting

References
[1] Microsoft Learn (2024) Power BI: Using visual calculations [preview] (link)
[2] SSBI Central (2024) Visual Calculations - Making DAX easier, with Jeroen ter Heerdt(link)

18 April 2024

🏭🗒️Microsoft Fabric: Mirroring [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 18-Apr-2024

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

[Microsoft Fabric] Mirroring

  • low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]
    • supported in Azure SQL Database, Azure Cosmos DB, Snowflake [1]
    • ⇒ replaces ETL pipelines
    • ⇐ manages the replication of data into OneLake and conversion to Parquet [1]
    • changes have a near real-time latency [3]
    • uses the source database’s CDC feature [4]
  • {benefit} allows users to use a highly integrated, end-to-end, and easy-to-use service [1]
    • the mirrored data can be used in Power BI
      • ⇐ since tables are all v-ordered delta tables [3]
  • {benefit} the delta tables can then be used in every Fabric experience, allowing users to accelerate their journey into Fabric [1]
    • {restriction} replication of views is currently not supported [3]
  • {benefit} allows to write cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of Lakehouses in a single T-SQL query [1]
  • landing zone in OneLake 
    • stores both the snapshot and change data [3]
      • ⇐ improves performance when converting files into delta verti-parquet [3]
  • {operation} enable the mirroring 
    • by creating a secure connection to the operational data source
    • an entire database or individual tables can be replicated 
    • creates three items in the targeted workspace:
      • the mirrored database
      • a SQL analytics endpoint
      • a Default semantic model
    • the same source database can be mirrored multiple times
      • ⇐ though usually not needed
        • scenario: replicating data to different types of environments
  • {operation} stopping the mirroring
    • stops the replication in the source database, but a copy of the tables is kept in OneLake [3]
  • {operation} restarting the mirroring 
    • results in all data being replicated from the start [3]
  • {operation} remove a table from mirroring
    • the table is no longer replicated and its data is deleted from OneLake [3]
  • {operation} sharing
    • users grant other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]
      • access is also granted to the SQL analytics endpoint and associated default semantic model [1]
      • shared mirrored databases can be found through either
        • Data Hub
        • Shared with Me section in Microsoft Fabric
    • triggers an initial replication
  • {requirement} [licensing] requires Power BI Premium, Fabric Capacity, or Trial Capacity
  • {feature} monitoring
    • {benefit} allows to gain insights into mirroring operations and when the replica in Fabric OneLake was last refreshed [4]
  • [Azure SQL Database] 
    • {restriction} doesn't support Azure SQL Database logical servers behind an AVN or private networking [2]
      • {requirement} update the Azure SQL logical server firewall rules to Allow public network access [2]
      • {requirement} enable the Allow Azure services option to connect to your Azure SQL Database logical server [2]
    • {restriction}
      •  access through the Power BI Gateway or behind a firewall is unsupported [3]
    • authentication
      • SQL authentication with user name and password
      • Microsoft Entra ID
      • Service Principal
    • {troubleshooting} check if the changes properly flow
      • via sys.dm_change_feed_log_scan_sessions DMV
    • {troubleshooting} check if there are any problems reported
      • via sys.dm_change_feed_errors DMV
    • {troubleshooting} check if the mirroring was properly enabled
      • via sp_help_change_feed stored procedure 
    • {troubleshooting} disable mirroring
      • via sp_change_feed_disable_db stored procedure
Acronyms:
AVN - Azure Virtual Network
CDC - Change Data Capture
ETL - Extract, Transfer, Load
DMV - Dynamic Management View

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2023) (link)

Resources:
[1] Tutorial: Configure Microsoft Fabric mirrored databases from Azure SQL Database [Preview] (link)
[2] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

06 April 2024

🏭🗒️Microsoft Fabric: Data Governance [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 23-May-2024

[Microsoft Fabric] Data Governance

  • {definition}set of capabilities that help organizations to manage, protect, monitor, and improve the discoverability of data, so as to meet data governance (and compliance) requirements and regulations [2]
  • several built-in governance features are available to manage and control the data within Fabric (MF)  [1]
  • {feature} endorsement [aka content endorsement
    • {definition} formal process performed by admins to endorse MF items
    • {benefit} allows admins to designate specific MF items as trusted and approved for use across the organization [1]
      • establishes trust in data assets by promoting and certifying specific MF items [1]
        • users know which assets they can trust and rely on for accurate information [1]
      • endorsed assets are identified with a badge that indicates they have been reviewed and approved [1]
    • {scope} applies to all MF items except dashboards [1]
    • {benefit} helps admin manage the overall growth of items across your environment [1]
  • {feature} promoting [aka content promoting
    • {definition} formal process performed by contributors or admins to promote content
    • promoted content appears with a Promoted badge in the MF portal [1]
      • workspace members with the contributor or admin role can promote content within a workspace [1]
      • MF admin can promote content across the organization [1]
  • {feature} certification [aka content certification]
    • {definition} formal process that involves a review of the content by a designated reviewer and managed by the admin [1]
      • can be customized to meet organization’s needs [1]
      • users can request item certification from an admin [1]
        • via Request certification from the More menu [1]
      • the certified content appears with a Certified badge in the Fabric portal [1]
    • {benefit} allows organizations to label items considered to be quality items [1]
      • an organization can certify items to identify them an as authoritative sources for critical information [1]
        • ⇐ all Fabric items except Power BI dashboards can be certified [1]
    • {benefit} allows to specify certifiers who are experts in the domain [1]
    • domain level settings
      • enable or disable certification of items that belong to the domain [1]
    • provides a URL to documentation that is relevant to certification in the domain [1]
  • {feature} tenant (aka Microsoft Fabric tenant, MF tenant)
    • a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
    • can contain any number of workspaces
  • {feature} workspaces
    • {definition} a collection of items that brings together different functionality in a single environment designed for collaboration
    • can be assigned to teams or departments based on governance requirements and data boundaries [2]
    • are associated with domains [3]
      • ⇐ {benefit} allows to group data into business domains
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [3]
        • ⇐ {benefit} enables a better consumption experience [1]
        • {benefit} enables better discoverability and governance [2]
  • {feature} domains [Notes]
    • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • allows to group data by business domains
      • ⇒{benefit} allows business domains to manage their data according to their specific regulations, restrictions, and needs [3]
    • {feature} subdomains
      • {definition} a way for fine tuning the logical grouping data under a domain [1]
        • ⇐ subdivisions of a domain
  • {feature} labeling
    • default labeling, label inheritance, and programmatic labeling, 
    • {benefit} help achieve maximal sensitivity label coverage across MF [2]
    • once labeled, data remains protected even when it's exported out of MF via supported export paths [2]
    • [Purview Audit] compliance admins can monitor activities on sensitivity labels
  • {feature|preview} folders
    • {definition} a way of logically grouping MF items
  • {feature|preview} tags
    • {benefit} allow managing Fabric items for enhanced compliance, discoverability, and reuse
  • {feature} scanner API
    • a set of admin REST APIs 
    • {benefit} allows to scan MF items for sensitive data [1]
    • can be used to scan both structured and unstructured data [1]
    • {concept} metadata scanning
      • facilitates governance of data by enabling cataloging and reporting on all the metadata of organization's Fabric items [1]
      • it needs to be set up by Admin before metadata scanning can be run [1]
  • {concept} data lineage
    • {definition} 
    • {benefit} allows to track the flow of data through Fabric [1]
    • {benefit} allows to see where data comes from, how it's transformed, and where it goes [1]
    • {benefit} helps understand the data available in Fabric, and how it's being used [1]
  • {concept} Fabric item (aka MF item)
    • {definition} a set of capabilities within an experience
      • form the building blocks of the Fabric platform
    • {type} data warehouse
    • {type} data pipeline
    • {type} semantic model
    • {type} reports
    • {type} dashboards
    • {type} notebook
    • {type} lakehouse
    • {type} metric set

Acronyms:
API - Application Programming Interface
MF - Microsoft Fabric

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Last updated: 31-Mar-2024

[Microsoft Fabric] Polaris

  • {definition} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]
    • the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
    • petabyte-scale execution [1]
    • highly-available micro-service architecture
      • data and query processing is packaged into units (aka tasks) [1]
        • can be readily moved across compute nodes and re-started at the task level [1]
    • can run directly over data in HDFS and in managed transactional stores [1]
  • [Azure Synapse] designed initially to execute read-only queries [1]
    • ⇐ the architecture behind serverless SQL pool
    • uses a completely new scale-out framework based on a distributed SQL Server query engine [1]
        • fully compatible with T-SQL
        • leverages SQL Server single-node runtime and QO [1]
  • [Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]
    • incl. updates, deletes and bulk loads [2]
    • based on [delta tables] and [delta lake]
      • the delta lake supports currently only transactions within one table [4]
    • ⇐ the architecture behind lakehouses
  • {goal} converge DWH and big data workloads [1]
    • the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]
      • needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing
  • {goal} separate compute and state for cloud-native execution [1]
    • all services within a pool are stateless
      • data is stored durably in remote storage and is abstracted via data cells [1]
        • ⇐ data is naturally decoupled from compute nodes
    • the metadata and transactional log state is off-loaded to centralized services [[1]
    • multiple compute pools can transactionally access the same logical database [1]
  • {goal} cloud-first [2]
    • {benefit} leverages elasticity
    • transactions need to be resilient to node failures on dynamically changing topologies [2]
      •  ⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]
    • must ensure disaggregation of metadata and transactional state from compute nodes [2]
      • ⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]
        • ⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]
  • {goal} use optimized native columnar, immutable and open storage format [2]
    • uses delta format 
      • ⇐ optimized to handle read-heavy workloads with low contention [2] 
  • {goal} leverage the full potential of vectorized query processing for SQL [2]
  • {goal} support zero-copy data sharing with other services in the lake [2]
  • {goal} support read-heavy workloads with low contention [2]
  • {goal} support lineage-based features [2]
    • by taking advantage of delta table capabilities 
  • {goal} provide full SQL SI transactional support [2]
    • {benefit} all traditional DWH requirements are met [2]
      • incl. multi-table and multi-statement transactions [2]
        • ⇐ Polaris is the only system that supports this [2]
        • the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
        • mixes of transactions are supported as well
  • {objective} no cross-component state sharing [2] 
    • {principle} encapsulation of state within each component to avoid sharing state across nodes [2]
    • SI and the isolation of state across components allows to execute transactions as if they were queries [2]
      • ⇒ makes read and write transactions indistinguishable [2]
        • ⇒ allows to fully leverage its optimized distributed execution framework [2]
  • {objective} support snapshot Isolation (SI) semantics [2]
    • implemented over versioned data
    • allows reads (R) and writes (W) to proceed concurrently over their own data snapshot 
      • R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2] 
      • ⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]
        • follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]
      • ⇐  R transactions at the snapshot isolation level create no contention
        •  ⇒  any number of R transactions can run concurrently [4]
    • the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]
      • similar to how temporary tables are discarded during query processing failures [2]
  • {feature} resize live workloads [1]
    • scales resources with the workloads automatically
  • {feature} deliver predictable performance at scale [1]
    • scales computational resources based on workloads' needs
  • {feature} efficiently handle both relational and unstructured data [1]
  • {feature} flexible, fine-grained task monitoring
    • a task is the finest grain of execution 
  • {feature} global resource-aware scheduling
    • enables much better resource utilization and concurrency than traditional DWHs
      • capable of handling partial query restarts
      • maintains a global view of multiple queries
    • it is planned to build on this a global view with autonomous workload management features
  • {feature} multi-layered data caching model
    • leverages 
      • SQL Server buffer pools for cashing columnar data
      • SSD caching
    • the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]
  • {feature} tracks data lineage natively
    • the transaction log can also be used to audit logging based on the commit Info records [4]
  • {feature} versioning
    • maintain all versions as data is updated [1]
  • {feature} time-travel
    • {benefit} allows users query point-in-time snapshots
    • {benefit)} allows to roll back erroneous updates to the data.
  • {feature} table cloning
    • {benefit} allows to create a point-in-time snapshot of the data based on its metadata
  • {concept} state 
    • allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
    • comprised of 
      • cache
      • metadata
      • transaction logs
      • data
    • [on-premises architecture] all state is in the compute layer
      • relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
      • {downside} expensive
      • {downside} hard to maintain
      • {downside} limited scalability
        • cluster capacity is bounded by machine sizes because of the fixed topology
  • {concept}[stateful architecture
    • the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]
      • ⇒ when a compute node fails, the state of non-committed transactions is lost [1] 
        •  ⇒ the in-flight transactions fail as well [1]
    • often also couples metadata describing data distributions and mappings to compute nodes [1] 
      • ⇒ a compute node effectively owns responsibility for processing a subset of the data [1] 
        • its ownership cannot be transferred without a cluster restart [1]
    • {downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]
  • {concept} [stateless compute architecture
    • requires that compute nodes hold no state information [1]
      • ⇒ all data, transactional logs and metadata need to be externalized [1]
    • {benefit} allows applications to 
      • partially restart the execution of queries in the event of compute node failures [1] 
      • adapt to online changes of the cluster topology without failing in-flight transactions [1] 
    • caches need to be as close to the compute as possible [1] 
      • since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1] 
        • the coupling of caches and compute does not make the architecture stateful [1] 
  • {concept} [cloud] decoupling of compute and storage
    • provides more flexible resource scaling
      • the 2 layers can scale up and down independently adapting to user needs [1] 
      • customers pay for the compute needed to query a working subset of the data [1] 
    • is not the same as decoupling compute and state [1] 
      • if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1] 
Acronyms:
ADLS - Azure Data Lake Storage
CRUD - Create, Read, Update, Delete
DCP - distributed computation platform 
DFS - Distributed File System
DWH - data warehouse
HDFS - Hadoop DFS
SI - Semantic Isolation 
SSD - Solid-State Drive

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12)  (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

29 March 2024

🗄️🗒️Data Management: Data [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. 
Last updated: 29-Mar-2024

Data

  • {definition} raw, unrelated numbers or entries that represent facts, concepts,  events,  and/or associations 
  • categorized by
    • domain
      • {type} transactional data
      • {type} master data
      • {type} configuration data
        • {subtype}hierarchical data
        • {subtype} reference data
        • {subtype} setup data
        • {subtype} policy
      • {type} analytical data
        • {subtype} measurements
        • {subtype} metrics
        • {subtype} 
    • structuredness
      • {type} structured data
      • {type} semi-structured data
      • {type} unstructured data
    • statistical usage as variable
      • {type} categorical data (aka qualitative data)
        • {subtype} nominal data
        • {subtype} ordinal data
        • {subtype} binary data
      • {type} numerical data (aka quantitative data)
        • {subtype} discrete data
        • {subtype} continuous data
    • size
      • {type} small data
      • {type} big data
  • {concept} transactional data
    • {definition} data that describe business transactions and/or events
    • supports the daily operations of an organization
    • commonly refers to data created and updated within operational systems
    • support applications that automated key business processes 
    • usually stored in normalized tables
  • {concept} master data
    • {definition}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]
      • the key business entities on which transaction are executed
    • the dimensions around on which analysis is conducted
      • used to categorize, evaluate and aggregate transactional data
    • can be shared across more than one transactional applications
    • there are master data similar to most organizations, but also master data specific to certain industries 
    • often appear in more than one area within the business
    • represent one version of the truth
    • can be further divided into specialized subsets
    • {concept} master data entity
      • core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies 
      • may be classified within a hierarchy
        • the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways
          • e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization
    • do not change as frequent like transactional data
      • less volatile than transactional data 
      • there are master data that don’t change at all 
        • e.g. geographic locations
    • strategic asset of the business 
    • needs to be managed with the same diligence as other strategic assets
  • {concept} metadata 
    • {definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]
      • data about data
    • refers to 
      • database schemas for OLAP & OLTP systems
      • XML document schemas
      • report definitions
      • additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
      • application configuration data
  • {concept} analytical data
    • {definition} data that supports analytical activities 
      • e.g. decision making, reporting queries and analysis 
    • comprises
      • numerical values
      • metrics
      • measurements
    • stored in OLAP repositories 
      • optimized for decision support 
      • enterprise data warehouses
      • departmental data marts
      • within table structures designed to support aggregation, queries and data mining 
  • {concept} hierarchical data 
    • {definition} data that reflects a hierarchy 
      • relationships between data are represented in hierarchies
    • typically appears in analytical applications
    • {concept} hierarchy
      • "a classification structure arranged in levels of detail from the broadest to the most detailed level" [2]
      • {concept} natural hierarchy
        • stem from domain-based attributes
        • represent an intrinsic structure of the dat
          • they are natural for the data
            • e.g. product taxonomy (categories/subcategories)
        • useful for drilling down from a general to a detailed level in order to find reasons, patterns, and problems
          • common way of analyzing data in OLAP applications
          • common way of filtering data in OLTP applications
      • {concept} explicit hierarchy
        • organize data according to business needs
        • entity members can be organized in any way
        • can be ragged
          • the hierarchy can end at different levels
      • {concept} derived hierarchy
        • domain-based attributes form natural hierarchies 
        • relationships between entities must already exist in a model
        • can be recursive
  • {concept} structured data
    • {definition} "data that has a strict metadata defined"
  • {concept} unstructured data 
    • {definition} data that doesn't follow predefined metadata
    • involves all kinds of documents 
    • can appear in a database, in a file, or even in printed material
  • {concept} semi-structured data 
    • {definition} structured data stored within unstructured data,
    • data typically in XML form
      • XML is widely used for data exchange
    • can appear in stand-alone files or as part of a database (as a column in a table)
    • useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema
References:
[1] The Art of Service (2017) Master Data Management Course 
[2] DAMA International (2011) "The DAMA Dictionary of Data Management", 
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.