Showing posts with label order. Show all posts
Showing posts with label order. Show all posts

17 March 2025

🏭🗒️Microsoft Fabric: Z-Order [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 17-Mar-2024

[Microsoft Fabric] Z-Order
  • {def} technique to collocate related information in the same set of files [2]
    • ⇐ reorganizes the layout of each data file so that similar column values are strategically collocated near one another for maximum efficiency [1]
    • {benefit} efficient query performance
      • reduces the amount of data to read [2] for certain queries
        • when the data is appropriately ordered, more files can be skipped [3]
        • particularly important for the ordering of multiple columns [3]
    • {benefit} data skipping
      • automatically skips irrelevant data, further enhancing query speeds
        • via data-skipping algorithms [2]
    • {benefit} flexibility
      • can be applied to multiple columns, making it versatile for various data schemas
    • aims to produce evenly-balanced data files with respect to the number of tuples
      • ⇐ but not necessarily data size on disk [2]
        • ⇐ the two measures are most often correlated [2]
          • ⇐ but there can be situations when that is not the case, leading to skew in optimize task times [2]
    • via ZORDER BY clause 
      • applicable to columns with high cardinality commonly used in query predicates [2]
      • multiple columns can be specified as a comma-separated list
        • {warning} the effectiveness of the locality drops with each extra column [2]
          • has tradeoffs
            • it’s important to analyze query patterns and select the right columns when Z Ordering data [3]
        • {warning} using columns that do not have statistics collected on them is  ineffective and wastes resources [2] 
          • statistics collection can be configured on certain columns by reordering columns in the schema, or by increasing the number of columns to collect statistics on [2]
      • {characteristic} not idempotent
        • every time is executed, it will try to create a new clustering of data in all files in a partition [2]
          • it includes new and existing files that were part of previous z-ordering [2]
      • compatible with v-order
    • {concept} [Databricks] liquid clustering 
      • replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance [4] [6]
        • not compatible with the respective features [4] [6]
      • tables created with liquid clustering enabled have numerous Delta table features enabled at creation [4] [6]
      • provides flexibility to redefine clustering keys without rewriting existing data [4] [6]
        • ⇒ allows data layout to evolve alongside analytic needs over time [4] [6]
      • applies to 
        • streaming tables 
        • materialized views
      • {scenario} tables often filtered by high cardinality columns [4] [6]
      • {scenario} tables with significant skew in data distribution [4] [6]
      • {scenario} tables that grow quickly and require maintenance and tuning effort [4] [6]
      • {scenario} tables with concurrent write requirements [4] [6]
      • {scenario} tables with access patterns that change over time [4] [6]
      • {scenario} tables where a typical partition key could leave the table with too many or too few partitions [4] [6]

    References:
    [1] Bennie Haelen & Dan Davis (2024) Delta Lake Up & Running: Modern Data Lakehouse Architectures with Delta Lake
    [2] Delta Lake (2023) Optimizations [link]
    [3] Delta Lake (2023) Delta Lake Z Order, by Matthew Powers [link]
    [4] Delta Lake (2025) Use liquid clustering for Delta tables [link]
    [5] Databricks (2025) Delta Lake table format interoperability [link]
    [6] Microsoft Learn (2025) Use liquid clustering for Delta tables [link]

    Resources:
    [R1] Azure Guru (2024) Z Order in Delta Lake - Part 1 [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    MF - Microsoft Fabric 

    🏭🗒️Microsoft Fabric: V-Order [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 17-Mar-2024

    [Microsoft Fabric] V-Order
    • {def} write time optimization to the parquet file format that enables fast reads under the MF compute engine [2]
      • all parquet engines can read the files as regular parquet files [2]
      • results in a smaller and therefore faster files to read [5]
        • {benefit} improves read performance 
        • {benefit} decreases storage requirements
        • {benefit} optimizes resources' usage
          • reduces the compute resources required for reading data
            • e.g. network bandwidth, disk I/O, CPU usage
      • still conforms to the open-source Parquet file format [5]
        • they can be read by non-Fabric tools [5]
      • delta tables created and loaded by Fabric items automatically apply V-Order
        • e.g. data pipelines, dataflows, notebooks [5]
      • delta tables and its features are orthogonal to V-Order [2]
        •  e.g. Z-Order, compaction, vacuum, time travel
        • table properties and optimization commands can be used to control the v-order of the partitions [2]
      • compatible with Z-Order [2]
      • not all files have this optimization applied [5]
        • e.g. Parquet files uploaded to a Fabric lakehouse, or that are referenced by a shortcut 
        • the files can still be read, the read performance likely won't be as fast as an equivalent Parquet file that's had V-Order applied [5]
      • required by certain features
        • [hash encoding] to assign a numeric identifier to each unique value contained in the column [5]
      • {command} OPTIMIZE 
        • optimizes a Delta table to coalesce smaller files into larger ones [5]
        • can apply V-Order to compact and rewrite the Parquet files [5]
    • [warehouse] 
      • works by applying certain operations on Parquet files
        • special sorting
        • row group distribution
        • dictionary encoding
        • compression 
      • enabled by default
      •  ⇒ compute engines require less network, disk, and CPU resources to read data from storage [1]
        • provides cost efficiency and performance [1]
          • the effect of V-Order on performance can vary depending on tables' schemas, data volumes, query, and ingestion patterns [1]
        • fully-compliant to the open-source parquet format [1]
          • ⇐ all parquet engines can read it as regular parquet files [1]
      • required by certain features
        • [Direct Lake mode] depends on V-Order
      • {operation} disable V-Order
        • causes any new Parquet files produced by the warehouse engine to be created without V-Order optimization [3]
        • irreversible operation
          •  once disabled, it cannot be enabled again [3]
        • {scenario} write-intensive warehouses
          • warehouses dedicated to staging data as part of a data ingestion process [1]
        • {warning} consider the effect of V-Order on performance before deciding to disable it [1]
          • {recommendation} test how V-Order affects the performance of data ingestion and queries before deciding to disable it [1]
        • via ALTER DATABASE CURRENT SET VORDER = OFF; [3]
      • {operation} check current status
        • via  SELECT name, is_vorder_enabled FROM sys.databases; [post]
    • {feature} [lakehouse] Load to Table
      • allows to load a single file or a folder of files to a table [6]
      • tables are always loaded using the Delta Lake table format with V-Order optimization enabled [6]
    • [Direct Lake semantic model] 
      • data is prepared for fast loading into memory [5]
        • makes less demands on capacity resources [5]
        • results in faster query performance [5]
          • because less memory needs to be scanned [5]

    References:
    [1] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]
    [2] Microsoft Learn (2024) Delta Lake table optimization and V-Order [link]
    [3] Microsoft Learn (2024) Disable V-Order on Warehouse in Microsoft Fabric [link]
    [4] Miles Cole (2024) To V-Order or Not: Making the Case for Selective Use of V-Order in Fabric Spark [link]
    [5] Microsoft Learn (2024) Understand storage for Direct Lake semantic models [link]
    [6] Microsoft Learn (2025] Fabric: Load to Delta Lake table [link]

    Resources:
    [R1] Serverless.SQL (2024) Performance Analysis of V-Ordering in Fabric Warehouse: On or Off?, by Andy Cutler [link]
    [R2] Redgate (2023 Microsoft Fabric: Checking and Fixing Tables V-Order Optimization, by Dennes Torres [link]
    [R3] Sandeep Pawar (2023) Checking If Delta Table in Fabric is V-order Optimized [link]
    [R4] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    MF - Microsoft Fabric

    14 January 2025

    🧭Business Intelligence: Perspectives (Part XXII: Breaking Queries' Complexity)

    Business Intelligence Series
    Business Intelligence Series

    Independently whether standalone or encapsulated in database objects, the queries written can become complex in time, respectively difficult to comprehend and maintain. One can reduce the cognitive load by identifying the aspects that enable one’s intuition - order, similarity and origin being probably the most important aspects that help coping with the inherent complexity. 

    One should start with the table having the lowest level of detail, usually a transaction table that forms the backbone of a certain feature. For example, for Purchase Orders this could be upon case the distribution or line level. If Invoices are added to the logic, and there could be multiple invoice line for a record from the former logic, then this becomes the new level of detail. Further on, if General Ledger lines are added, more likely this becomes the lowest level of detail, and so on.

    This approach allows to keep a consistent way of building the queries while enabling to validate the record count, making sure that no duplicates are added to the logic. Conversely, one can start also from the table with the lowest level of details, and add tables successively based on their level of detail, though the database engine may generate upon case a suboptimal plan. In addition, checking the cardinality may involve writing a second query. 

    One should try to keep the tables belonging to the same entity together, when possible (e.g. Purchase Order vs. Vendor information). This approach allows to reduce the volume of work required to manage, review, test and understand the query later. If the blocks are too big, then occasionally it makes sense to bring pieces of logic into CTEs (Common Table Expressions), or much better into views that allow to better encapsulate and partition the logic.

    CTEs allow to split the logic into logical steps, allowing occasionally to troubleshoot the logic on pieces though one should keep a balance between maintainability and detail. In extremis, people may create unnecessarily an additional CTE for each join. The longer and the more fragmented a query, the more difficult it becomes to troubleshoot and even understand. Readability can be better achieved though indentation, by keeping things together that belong together, respectively partitioning the logic in logical blocks that derive from the model. 

    Keeping things together should be applied also to the other elements. For example, the join constraints should be limited only to the fields participating in the join (and, if possible, all the other constraints should be brought in the WHERE clause). Bringing the join constraints in the WHERE clause, an approach used in the past, decreases query readability no matter how well the WHERE clause is structured, and occasionally can lead to constraints’ duplication or worse, to missing pieces of logic. 

    The order of the attributes should follow a logic that makes sense. One approach is to start from the tables with lowest cardinality that identify entities uniquely and move to the attributes that have a higher level of detail. However, not all attributes are equally important, and thus one might have to compromise and create multiple groups of data coming from different levels. 

    One should keep in mind that the more random the order of the attributes is, the more difficult it becomes to validate the data as one needs to jump multiple times either in the query or in the mask. Ideally one should find a balance between the two perspectives. Having an intuitive logic of how the attributes are listed increases queries’ readability, maintainability and troubleshooting. The more random attributes’ listing, the higher the effort for the same. 

    Previous Post  <<||>>   Next Post

    26 December 2011

    📉Graphical Representation: Order (Just the Quotes)

    "Numerical facts, like other facts, are but the raw materials of knowledge, upon which our reasoning faculties must be exerted in order to draw forth the principles of nature. [...] Numerical precision is the soul of science [...]" (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

    "Tables are [...] the backbone of most statistical reports. They provide the basic substance and foundation on which conclusions can be based. They are considered valuable for the following reasons:" (1) Clarity - they present many items of data in an orderly and organized way." (2) Comprehension - they make it possible to compare many figures quickly." (3) Explicitness - they provide actual numbers which document data presented in accompanying text and charts." (4) Economy - they save space, and words." (5) Convenience - they offer easy and rapid access to desired items of information." (Peter H Selby, "Interpreting Graphs and Tables", 1976)

    "The preparation of well-designed graphics is both an art and a skill. There are many different ways to go about the task, and readers are urged to develop their own approaches. Graphics can be creative and fun. At the same time, they require a degree of orderly and systematic work." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

    "Unlike some art forms. good graphics should be as concrete, geometrical, and representational as possible. A rectangle should be drawn as a rectangle, leaving nothing to the reader's imagination about what you are trying to portray. The various lines and shapes used in a graphic chart should be arranged so that it appears to be balanced. This balance is a result of the placement of shapes and lines in an orderly fashion." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

    "[…] the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between pies. […] Given their low data-density and failure to order numbers along a visual dimension, pie charts should never be used." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

    "The time-series plot is the most frequently used form of graphic design. With one dimension marching along to the regular rhythm of seconds, minutes, hours, days, weeks, months, years, centuries, or millennia, the natural ordering of the time scale gives this design a strength and efficiency of interpretation found in no other graphic arrangement." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

    "We envision information in order to reason about, communicate, document, and preserve that knowledge - activities nearly always carried out on two-dimensional paper and computer screen. Escaping this flatland and enriching the density of data displays are the essential tasks of information design." (Edward R Tufte, "Envisioning Information", 1990)

    "Many of the applications of visualization in this book give the impression that data analysis consists of an orderly progression of exploratory graphs, fitting, and visualization of fits and residuals. Coherence of discussion and limited space necessitate a presentation that appears to imply this. Real life is usually quite different. There are blind alleys. There are mistaken actions. There are effects missed until the very end when some visualization saves the day. And worse, there is the possibility of the nearly unmentionable: missed effects." (William S Cleveland, "Visualizing Data", 1993)

    "Visual thinking can begin with the three basic shapes we all learned to draw before kindergarten: the triangle, the circle, and the square. The triangle encourages you to rank parts of a problem by priority. When drawn into a triangle, these parts are less likely to get out of order and take on more importance than they should. While the triangle ranks, the circle encloses and can be used to include and/or exclude. Some problems have to be enclosed to be managed. Finally, the square serves as a versatile problem-solving tool. By assigning it attributes along its sides or corners, we can suddenly give a vague issue a specific place to live and to move about." (Terry Richey, "The Marketer's Visual Tool Kit", 1994)

    "Often many tracings are shown together. Extraneous parts of the tracings must be eliminated and relevant tracings should be placed in a logical order. Repetitious labels should be eliminated and labels added that will fully clarify your information." (Mary H Briscoe, "Preparing Scientific Illustrations: A guide to better posters, presentations, and publications" 2nd ed., 1995)

    "Averages, ranges, and histograms all obscure the time-order for the data. If the time-order for the data shows some sort of definite pattern, then the obscuring of this pattern by the use of averages, ranges, or histograms can mislead the user. Since all data occur in time, virtually all data will have a time-order. In some cases this time-order is the essential context which must be preserved in the presentation." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

    "The acquisition of information is a flow from noise to order - a process converting entropy to redundancy. During this process, the amount of information decreases but is compensated by constant re-coding. In the recoding the amount of information per unit increases by means of a new symbol which represents the total amount of the old. The maturing thus implies information condensation. Simultaneously, the redundance decreases, which render the information more difficult to interpret." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

    "A bar graph typically presents either averages or frequencies. It is relatively simple to present raw data" (in the form of dot plots or box plots). Such plots provide much more information. and they are closer to the original data. If the bar graph categories are linked in some way - for example, doses of treatments - then a line graph will be much more informative. Very complicated bar graphs containing adjacent bars are very difficult to grasp. If the bar graph represents frequencies. and the abscissa values can be ordered, then a line graph will be much more informative and will have substantially reduced chart junk." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

    "A useful feature of a stem plot is that the values maintain their natural order, while at the same time they are laid out in a way that emphasises the overall distribution of where the values are concentrated (that is, where the longer branches are). This enables you easily to pick out key values such as the median and quartiles." (Alan Graham, "Developing Thinking in Statistics", 2006)

    "Where there is no natural ordering to the categories it can be helpful to order them by size, as this can help you to pick out any patterns or compare the relative frequencies across groups. As it can be difficult to discern immediately the numbers represented in each of the categories it is good practice to include the number of observations on which the chart is based, together with the percentages in each category." (Jenny Freeman et al, "How to Display Data", 2008)

    "[...] without conscious effort, the brain always tries to close the distance between observed phenomena and knowledge or wisdom that can help us survive. This is what cognition means. The role of an information architect is to anticipate this process and generate order before people’s brains try to do it on their own." (Alberto Cairo, "The Functional Art", 2011)

    "A viewer’s eye must be guided to 'read' the elements in a logical order. The design of an exploratory graphic needs to allow for the additional component of discovery - guiding the viewer to first understand the overall concept and then engage her to further explore the supporting information." (Felice C Frankel & Angela H DePace, "Visual Strategies", 2012)

    "With further similarities to small multiples, heatmaps enable us to perform rapid pattern matching to detect the order and hierarchy of different quantitative values across a matrix of categorical combinations. The use of a color scheme with decreasing saturation or increasing lightness helps create the sense of data magnitude ranking." (Andy Kirk, "Data Visualization: A successful design process", 2012)

    "Compared to the rainbow colormap, the heat map uses a smaller set of hues, but adds luminance as a way to order colors in an intuitive manner. Compared to the two-hue colormap, the heat map uses more hues, thus allowing one to discriminate between more data values." (Alexandru Telea, "Data Visualization: Principles and Practice" 2nd Ed., 2015)

    "Will you be encountering each other for the first time through this communication, or do you have an established relationship? Do they already trust you as an expert, or do you need to work to establish credibility? These are important considerations when it comes to determining how to structure your communication and whether and when to use data, and may impact the order and flow of the overall story you aim to tell." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

    "Complementary colors send a message of opposition but also of balance. A chart with saturated complementary colors is an aggressively colored chart in which the colors fight (equally) for their share of attention. Apply this rule when you intend to represent very distinct variables or those that for some reason you want to show as contrasting each other. Do not use complementary colors when variables have some form of continuity or order." (Jorge Camões, "Data at Work: Best practices for creating effective charts and information graphics in Microsoft Excel", 2016)

    "Ranks do not explain how much one item varies from another. Ranked data is ordinal; that is, the data is categorical and has a sequence (e.g., who finished the race first, second, and third). That’s it! Ranked data can be used for showing the order of the data points. […] When working with ranked data, you cannot make inferences about the variance in the data; all you can say with certainty is which item is ranked higher than the others, not how much higher." (Andy Kriebel & Eva Murray, "#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time", 2018)

    "Beyond the design of individual charts, the sequence of data visualizations creates grammar within the exposition. Cohesive visualizations follow common narrative structures to fully express their message. Order matters. " (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.