22 February 2025

🧩IT: The Annotated Laws that Govern IT Professionals' Lives - Part I

"A bad idea executed to perfection is still a bad idea." (Norman R Augustine) [Augustine's Law]

"Bad code executed by powerful machines is still bad code." [sql-troubles]

"A great many problems do not have accurate answers, but do have approximate answers, from which sensible decisions can be made." (Berkeley's Law)

"It's easier to take/sell approximations as accurate answers than to find accurate answers. In time people will see no difference in between." [sql-troubles]

"About the time you finish doing something, you know enough to start." (James C Kinser) [Kinser's Law]

"By the time you finish something, the problem changed." [sql-troubles]

"People will more likely repeat their known mistakes than trying something new." [sql-troubles]

"The ofter a method failed, the higher the chances for it to succeed when used by somebody else." [sql-troubles]

"People tend to reuse a method that previously failed (multiple times) than try something new." [sql-troubles]

"By the time we start something, somebody else solved already the problem." [sql-troubles]

"Adding manpower to a late software project makes it later." (Fred P Brooks, "The Mythical Man-Month: Essays", 1975) [Brook's Law]

"Adding manpower seldom solves a problem that requires intelligent effort." [sql-troubles]

"The easiest way to make a project on time is to to move the deadline as suited." [sql-troubles]

"An object will fall so as to do the most damage." [Law of selective gravity]

"A bug will appear to do the most damage." [sql-troubles]

"Anything can be made to work if you fiddle with it long enough." (Wyszkowski's second law)
"Some problems do require infinite time." [sql-troubles]

"Build a system that even a fool can use, and only a fool will want to use it." [Shaw's principle]

"Doing it the hard way is always easier." (Murphy's paradox)

"Doing it the easy way is always harder." [sql-troubles]

"Don't force it - get a bigger hammer." [Anthony's law of force]

"Don't optimize it, get a more powerful machine." [sql-troubles]

"Every solution breeds new problems." [Murphy's laws]

"Every new problem multiplies the possible solutions." [sql-troubles]

"It's easier to change the problem to fit the solution." [sql-troubles]

"Everyone has a scheme that will not work." [Howe's law]

"Any scheme can work by accident." [sql-troubles]

"It takes more than an accident for a scheme to work." [sql-troubles]

"Everything goes wrong all at once." (Quantized revision of Murphy's law)

"Small events converge toward bigger events." [sql-troubles]

"Things already went wrong before we observe them as such." [sql-troubles]

"If a problem causes many meetings, the meetings eventually become more important than the problem." (Arthur Bloch, "Murphy's Law (Price/Stern/Sloan", 1977) (Hendrickson’s Law)

"More meetings tend to create more problems." [sql-troubles]

 "Fewer meetings tend to create more problems." [sql-troubles]

"If a project is not worth doing at all, it's not worth doing well." (Gordon's first law)

"The more a project is not worth doing, the more attention will attract."  [sql-troubles]

"If an experiment works, something has gone wrong." [Finagle's first law]

"If anything can go wrong, it will." [Murphy's laws]

"Things go wrong at a faster pace than one can find solutions." [sql-troubles]

"If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it." [Murphy's Laws]

"It's enough one way, for things to result in catastrophes." [sql-troubles]

"Sometimes it's better to do nothing than make things worse." [sql-troubles]

"Once all the known wrong solutions were exhausted, one discovers a new wrong solution." [sql-troubles]

"If they know nothing of what you are doing, they suspect you are doing nothing." (Robert J Graham et al, "The Complete Idiot's Guide to Project Management", 2007)  [Graham's Law]

"People are good at ignoring the obvious." [sql-troubles]

"The more one explains, the more one is misunderstood." [sql-troubles] 

"If you mess with a thing long enough, it'll break." [Schmidt's law]

"Things break by design." [sql-troubles]

"One can learn to break things, by simply playing with them." [sql-troubles] 

"It's easier to break than design things. One can find thousands ways on how to break the same thing." [sql-troubles] 

"In any collection of data, the figure most obviously correct, beyond all need of checking, is the mistake." (Finagle's third law)

"In any collection of data there's at least a mistake." [sql-troubles]

"In any given set of circumstances, the proper course of action is determined by subsequent events." [McDonald's corollary to Murphy's laws]

"In crises that force people to choose among alternative courses of action, most people will choose the worst one possible." (Rudin's law)

"People go wrong with confidence." [sql-troubles]

"The more alternatives, the higher the chances to go wrong." [sql-troubles] 

"Information necessitating a change of design will be conveyed to the designer after - and only after - the plans are complete." [First law of revision:]

"In simple cases, presenting one obvious right way versus one obvious wrong way, it is often wiser to choose the wrong way so as to expedite subsequent revision." (First corollary

"The designer will get ahead of the design." [sql-troubles] 

"It is impossible to make anything foolproof because fools are so ingenious." (Murphy's second corollary)

"It works better if you plug it in." (Sattinger's law)

"It works longer if you don't plug it in." [sql-troubles]

"It's not a question of IF the car will break down, but WHEN it will break down." (Murphy's theory of automobiles)

"It's not a question of IF a program will break down, but when the code will break down." [sql-troubles]

"The longer a program runs smoothly, the higher the chances that will break down soon." [sql-troubles]

"Left to themselves, things tend to go from bad to worse." (Murphy's first corollary)

"The more on tries to fix things, the faster everything goes worse." [sql-troubles]

"Logic is a systematic method of coming to the wrong conclusion with confidence." (Manly's maxim)

 "One doesn't need logic to arrive at the right conclusion." [sql-troubles]

"Matter will be damaged in direct proportion to its value." (Murphy's constant)

"Most problems have either many answers or no answer. Only a few problems have a single answer." [Berkeley's Law]

"It's better to have a multitude of approximate solutions than one correct solution." [sql-troubles]

"Negative expectations yield negative results. Positive expectations yield negative results." (Non-reciprocal law of expectations)

"Negative results yield when there are no expectations." [sql-troubles]

"No matter how many things have gone wrong, there remains at least one more thing that will go wrong." (Murphy's law of the infinite)

"Things can go wrong in a multitude of ways." [sql-troubles]

"No matter how minor the job is, it's still over $50." (Murphy's rule of auto repair)

"No matter what the experiment's result, there will always be someone eager to: (i) misinterpret it, (ii) fake it, or (c) believe it supports his own pet theory." (Finagle's second law)
"It's easier to fake the experiment to get the right results." [sql-troubles]
"Nothing ever goes away." (Commoner's second law of ecology)
"Things do go away, but tend to come back." [sql-troubles]

"Nothing is as easy as it looks." (Murphy's first corollary)

"All things look simple until one dives deeper." [sql-troubles]

"Nothing is ever so bad that it can't get worse." (Gattuso's extension of Murphy's Law)

"Once a job is fouled up, anything done to improve it only makes it worse." (Finagle's fourth law)

"Once a mistake is corrected, a second mistake will become apparent." (Murphy's law of revision)

"Correcting mistakes introduces other mistakes." [sql-troubles]

"The chief cause of problems is solutions." [Sevareid's Law]

"The more time you spend in reporting on what you are doing, the less time you have to do anything. Stability is achieved when you spend all your time doing nothing but reporting on the nothing you are doing." [Cohn's Law]

"Reporting increases the needs for more information. The less one reports, the lower the need for further information." [sql-troubles]

"The more innocuous the modification appears to be, the further its influence will extend and the more plans will have to be redrawn." [H B Fyfe's second law of revision]

"The only thing more costly than stretching the schedule of an established development program is accelerating it, which is itself the most costly action known to man." (Norman R Augustine, "Augustine's Laws", 1983) [Law of economic unipolarity]

"The other line moves faster." (Etorre's observation)

"The other team moves faster." [sql-troubles]

"If you change lines, the one you just left will start to move faster than the one you are now in." (O'Brien's variation

"If you change a line, the whole codes breaks." [sql-troubles]

"The longer you wait in line, the greater the likelihood that you are in the wrong line." (The Queue Principal)

"The longer you wait for a deliverable, the greater the likelihood that it contains bugs." [sql-troubles]

"The perceived usefulness of an article is inversely proportional to its actual usefulness once bought and paid for." (Glatum's law of materialistic acquisitiveness)

"The probability of anything happening is in inverse ratio to its desirability." (Gumperson's law)

"The solution to a problem changes the problem." [Peers's Law]

"A problem to a solution changes thr solution." [sql-troubles]

"The tasks to do immediately are the minor ones; otherwise, you’ll forget them. The major ones are often better to defer. They usually need more time for reflection. Besides, if you forget them, they’ll remind you." [Wolf ’s Law of Management]

"There are two states to any large project: Too early to tell and too late to stop." (Ernest Fitzgerald) [Fitzgerald's Law]

"There is a solution to every problem; the only difficulty is finding it." [Evvie Nef's Law]

"There is a solution to every problem we are not trying to solve." [sql-troubles]

"Finding problems is easier than finding solutions." [sql-troubles]

"One stumbles upon the same problen twice." [sql-troubles]

"There is no mechanical problem so difficult that it cannot be solved by brute strength and ignorance. [William's Law]

"There's no software problem so difficult that can't be solved by brute force and ignorance." [sql-troubles]

"There's always one more bug." (Lubarsky's law of cybernetic entomology)

"Software solutions diverge to a set of bugs." [sql-troubles

"Things get worse under pressure." [Murphy's law of thermodynamics]

"Things get worse also without pressure." [sql-troubles]

"Things go right gradually, but things go wrong all at once." (Murphy's asymmetry principle)

"Tolerances will accumulate unidirectionally toward maximum difficulty of assembly. (Klipstein's law)

"Two wrongs are only the beginning." (Kohn's corollary to Murphy's law)

"One wrong can be the beginning of another." [sql-troubles]

"When all else fails, read the instructions." [Cahn's axiom]

"Even if you read the instructions, things fall." [sql-troubles]

"When an error has been detected and corrected, it will be found to have been correct in the first place." [Scott's second law]

"Any two related problems may look the same when regarded from same perspective." [sql-troubles]

"When in doubt, use a bigger hammer." [Dobbins’ Law]

"When taking something apart to fix a minor malfunction, you will cause a major malfunction." (Murphy's second law of construction)

"Whenever you set out to do something, something else must be done first." (Murphy's sixth corollary)

"While the difficulties and dangers of problems tend to increase at a geometric rate, the knowledge and manpower qualified to deal with these problems tend to increase linearly." [Dror's First Law]

"Beyond a point, the problems are so complex that people can't differentiate between geometric and linear rates." [sql-troubles]

 Previous Post <<||>> Next Post

21 February 2025

🧩IT: Idioms, Sayings, Proverbs and Other Words of Wisdom

In IT setups one can hear many idioms, sayings and other type of words of wisdom that make the audience smile, even if some words seem to rub salt in the wounds. These are some of the idioms met in IT meetings or literature. Frankly, it's worth to write more about each of them, and this it the purpose of the "project". 

"A bad excuse is better than none"

"A bird in the hand is worth two in the bush": a working solution is worth more than hypothetically better solutions. 

"A drowning man will clutch at a straw": a drowning organization will clutch to the latest hope

"A friend in need (is a friend indeed)": 

"A journey of a thousand miles begins with a single step"

"A little learning is a dangerous thing"

"A nail keeps a shoe, a shoe a horse, a horse a man, a man a castle" (cca 1610): A nail keeps the shoe

"A picture is worth a thousand words"

"A stitch in time (saves nine)"

"Actions speak louder than words"

"All good things must come to an end"

"All generalizations are false" [attributed to Mark Twain, Alexandre Dumas (Père)]: Cutting though Complexity

"All the world's a stage, And all [...] merely players": A look forward

"All roads lead to Rome"

"All is well that ends well"

"An ounce of prevention is worth a pound of cure"

"Another day, another dollar"

"As you sow so shall you reap"

"Beauty is in the eye of the beholder"

"Better late than never": SQL Server and Excel Data

"Better safe than sorry": Deleting obsolete companies

"Big fish eat little fish"

"Better the Devil you know (than the Devil you do not)": 

"Calm seas never made a good sailor"

"Count your blessings"

"Dead men tell no tales"

"Do not bite the hand that feeds you"

"Do not change horses in midstream"

"Do not count your chickens before they are hatched"

"Do not cross the bridge till you come to it"

"Do not judge a book by its cover"

"Do not meet troubles half-way"

"Do not put all your eggs in one basket"

"Do not put the cart before the horse"

"Do not try to rush things; ignore matters of minor advantage" (Confucius): A tale of two cities II

"Do not try to walk before you can crawl"

"Doubt is the beginning, not the end, of wisdom"

"Easier said than done"

"Every cloud has a silver lining"

"Every little bit helps"

"Every picture tells a story"

"Failing to plan is planning to fail"Planning correctly misunderstood...

"Faith will move mountains"

"Fake it till you make it"

"Fight fire with fire"

"First impressions are the most lasting"

"First things first": Ways of looking at data

"Fish always rots from the head downwards"

"Fools rush in (where angels fear to tread)" (Alexander Pope, "An Essay on Criticism", cca. 1711): A tale of two cities II

"Half a loaf is better than no bread"

"Haste makes waste"

"History repeats itself"

"Hope for the best, and prepare for the worst"

"If anything can go wrong, it will" (Murphy's law)

"If it ain't broke, don't fix it.": Approaching a query

"If you play with fire, you will get burned"

"If you want a thing done well, do it yourself"

"Ignorance is bliss"

"Imitation is the sincerest form of flattery"

"It ain't over till/until it's over"

"It is a small world"

"It is better to light a candle than curse the darkness"

"It is never too late": A look backAll-knowing developers are back...

"It's a bad plan that admits of no modification." (Publilius Syrus)Planning Correctly Misunderstood I

"It’s not an adventure until something goes wrong." (Yvon Chouinard)Documentation - Lessons learned

"It is not enough to learn how to ride, you must also learn how to fall"

"It takes a whole village to raise a child"

"It will come back and haunt you"

"Judge not, that ye be not judged"

"Kill two birds with one stone"

"Knowledge is power, guard it well"

"Learn a language, and you will avoid a war" (Arab proverb)

"Less is more"

"Life is what you make it"

"Many hands make light work"

"Moderation in all things"

"Money talks"

"More haste, less speed"

"Necessity is the mother of invention"

"Never judge a book by its cover"

"Never say never"

"Never too old to learn"

"No man can serve two masters"

"No pain, no gain"

"No plan ever survived contact with the enemy.' (Carl von Clausewitz)Planning Correctly Misunderstood I

"Oil and water do not mix"

"One-man show": series

"One man's trash is another man's treasure"

"One swallow does not make a summer"

"Only time will tell": The Software Quality Perspective and AI, Microsoft FabricIt’s all about Partnership IIAccess vs. LightSwitch

"Patience is a virtue"

"Poke the bear": Mea Culpa - A Look Forward

"Practice makes perfect"

"Practice what you preach"

"Prevention is better than cure"

"Rules were made to be broken"

"Seek and ye shall find"

"Some are more equal than others" (George Orwell, "Animal Farm")

"Spoken words fly away, written words remain." ["Verba volant, scripta manent"]: Documentation - Lessons learned

"Strike while the iron is hot"

"Technology is dead": Dashboards Are Dead & Other Crapprogramming is dead

"The best defense is a good offense"

"The bets are off":  A look forward

"The bigger they are, the harder they fall"

"The devil is in the detail": Copilot Stories Part IV, Cutting through ComplexityMore on SQL DatabasesThe Analytics MarathonThe Choice of Tools in PM, Who Messed with My Data?

"The die is cast"

"The exception which proves the rule"

"The longest journey starts with a single step"

"The pursuit of perfection is a fool's errand"

"There are two sides to every question"

"There is no smoke without fire"

"There's more than one way to skin a cat" (cca. 1600s)

"There is no I in team"

"There is safety in numbers"

"Those who do not learn from history are doomed to repeat it" (George Santayana)

"Time is money"

"To learn a language is to have one more window from which to look at the world" (Chinese proverb)[5

"Too little, too late"

"Too much of a good thing"

"Truth is stranger than fiction"

"Two birds with one stone": Deleting sequential data...

"Two heads are better than one": Pair programming

"Two wrongs (do not) make a right"

"United we stand, divided we fall"

"Use it or lose it"

"Unity is strength"

"Variety is the spice of life." (William Cowper)

"Virtue is its own reward"

"Well begun is half done"

"What does not kill me makes me stronger"

"Well done is better than well said"

"What cannot be cured must be endured"

"What goes around, comes around"

"When life gives you lemons, make lemonade"

"When the cat is away, the mice will play"

"When the going gets tough, the tough get going"

"Where there is a will there is a way"

"With great power comes great responsibility"

"Work expands so as to fill the time available"

"You are never too old to learn": All-Knowing Developers are Back in Demand?

"You can lead a horse to water, but you cannot make it drink"

"You cannot make an omelet without breaking eggs"

"(You cannot) teach an old dog new tricks"

"You must believe and not doubt at all": Believe and not doubt

"Zeal without knowledge is fire without light"

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2024) List of proverbial phrases [link]

20 February 2025

💠🛠️🗒️SQL Server: Folding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Folding

  • {def} the process by which the optimizer is able to properly determine certain types of indexable expressions even when the column in the expression is involved in a subexpression or nestled in a function 
    • is an optimization over older versions of SQL Server in which the optimizer was unable to use an index to service a query clause when the table column was involved in an expression or buried in a function [1]
    • {type} constant folding 
      • some constant expression is evaluated early
      • foldable constant expressions [2]
        • arithmetical expressions 
        • logical expressions 
        • built-in functions whose input doesn’t depend of contextual information, 
          • e. g. SET options, language settings, database options, encryption keys
          • deterministic built-in functions are foldable, with some exceptions
        • certain forms of the LIKE predicate
        • [SQL Server 2012+] deterministic methods of CLR user-defined types [3]
        • [SQL Server 2012+] deterministic scalar-valued CLR user-defined functions [3]
      • nonfoldable expressions [2]
        • expressions whose results depend on a local variable or parameter
        • user-defined functions
          • both T-SQL and CLR
        • expressions whose results depend on language settings.
        • expressions whose results depend on SET options.
        • expressions whose results depend on server configuration options.
        • nonconstant expressions such as an expression whose result depends on the value of a column.
        • nondeterministic functions
        • if the output is a large object type, then the expressions are not folded 
          • e.g. text, image, nvarchar(max), varchar(max), varbinary(max), XML
      • {benefit} the expression does not have to be evaluated repeatedly at run time [2]
      • {benefit} the value of the expression after it is evaluated is used by the query optimizer to estimate the size of the result set of the portion of the query [2]
        • e.g. TotalDue > 117.00 + 1000.00
    • {type} nonconstant folding
      • some expressions that are not constant folded but whose arguments are known at compile time, whether the arguments are parameters or constants, are evaluated by the result-set size (cardinality) estimator that is part of the optimizer during optimization [2]
      • deterministic functions: 
        • e.g. UPPER, LOWER, RTRIM
        • e.g. DATEPART( YY only ), GetDate, CAST, CONVERT
      • operators 
        • arithmetic operators: +, -, *, /, unary -, 
        • logical Operators: AND, OR, NOT
        • comparison operators: <, >, <=, >=, <>, LIKE, IS NULL, IS NOT NULL


    References:
    [1] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals
    [2] Microsoft Learn (2012) SQL Server: Troubleshooting Poor Query Performance: Constant Folding and Expression Evaluation During Cardinality Estimation [link]
    [3] Microsoft Learn (2025) SQL Server: Query processing architecture guide [link]
    [4] SQLShack (2021) Query Optimization in SQL Server for beginners, by Esat Erkec [link]

    💠🛠️🗒️SQL Server: Nulls [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

    Last updated: 20-Feb-2024

    [SQL Server] Null

    • {def} keyword that indicates that the value is unknown [1]
      • different from an empty or zero value [1]
      • no two null values are equal [1]
        • comparisons between two null values, or between a null value and any other value, return unknown because the value of each NULL is unknown [1]
      • indicates the the value is
        • unknown
        • not applicable
        • to be added later
        • ⇒ can't be used as information that is required to distinguish one row in a table from another row in a table [1]
    • can be assigned to a value by
      • explicitly stating NULL in an INSERT or UPDATE statement [1[
      • leaving a column out of an INSERT statement [1]
    • {recommendation} test for null values in queries 
      • via IS NULL or IS NOT NULL in the WHERE clause [1]
      • WHEN present in data, logical and comparison operators can potentially return a third result of UNKNOWN instead of just TRUE or FALSE [1]
        •  ⇐ three-valued logic can be the source for many application errors [1]
      • ⇐ parameters and variables not explicitly initialized can cause problems in code
    • {recommendation} handle null values in logic
      • via IsNull or Coalesce functions
    • {constraint} [primary kyes] if any of the columns considered in a primary key contain NULL values, the PRIMARY KEY constraint can’t be created [3]
    • {constraint} [UNIQUE constraint] allows the columns that make up the constraint to allow NULLs, but it doesn’t allow all key columns to be NULL for more than one row [3]
    • [data warehouse] nullability of columns
      • {best practice} define columns as NOT NULL when appropriate 
        • {benefit} helps the Query Optimizer 
        • {benefit} reduces in some cases the storage space required for the data
        • {benefit} allows SQL Server to avoid unnecessary encoding in columnstore indexes and during batch mode execution [2]
      • {example} [SQL Server 2000+] bigint column
        • when the value is defined as NOT NULL , the value fits into a single CPU register
          • ⇒ operations on the value can be performed more quickly
        • a nullable bigint column requires another, 65th bit to indicate NULL values
          • SQL Server avoids cross-register data storage by storing some of the row values (usually the highest or lowest values) in main memory using special markers to indicate it in the data that resides in the CPU cache [2]
            • ⇒ adds extra load during execution
      • {recommendation} avoid nullable columns in data warehouse environments [2]
        • ⇐ the recommendation can apply also to OLTP databases
          • there are database designs that enforces not null values for all attributes
            • e.g. Dynamics AX 2009/365 F&O
            • {benefit} eliminates the need to test for null values in legacy code
      • {recommendation} use CHECK and UNIQUE constraints or indexes when overhead introduced by constraints or unique indexes is acceptable [2]
      • {recommendation} consider using filtered indexes instead of normal indexes for columns with many null values
        • minimizes the waste of storage space
        • ⇐ understand the characteristics of the columns used in the queries [3]


    References:
    [1] Microsoft Learn (2024) SQL Server 2022: NULL and UNKNOWN (T-SQL)
    [2] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
    [3] Microsoft SQL Server 2012 Internals, by Kalen Delaney, Bob Beauchemin, Conor Cunningham, Jonathan Kehayias, Benjamin Nevarez & Paul S. Randal, Microsoft Press, ISBN: 978-0-7356-5856-1 , 2013

    16 February 2025

    💠🛠️🗒️SQL Server: Columnstore Indexes [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

    Last updated: 15-Feb-2024

    [SQL Server] columnstore indexes (CI)

    • {def} a technology for storing, retrieving and managing data by using a columnar data format (aka columnstore
      • store compressed data on a per-column rather than a per-row basis [5]
    • {benefit} designed for analytics and data warehousing workloads
      • data warehousing
        • {scenario} store fact tables and large dimension tables
          • ⇐ tend to require full table scans rather than table seeks
      • analytics workloads
        • {scenario} [SQL Server 2016 SP1] can be used for real-time analytics on operational databases
          • ⇐ an updatable nonclustered columnstore index can be created on a rowstore table
    • {benefit} performance increase 
      • can achieve up to 100x better performance [4]
      • offers an order of magnitude better performance than a rowstore index
        • {feature} uses batch mode execution
          • improves query performance typically by two to four times
        • have high performance gains for analytic queries that scan large amounts of data, especially on large tables (>1 million rows) 
    • {benefit} reduces significantly the data warehouse storage costs
      • {feature} data compression
        • ⇒ provides high compression rates, typically by 10 times
          • ⇒ reduces total I/O from the physical media
            • ⇐ queries often select only a few columns from a table
            • minimizes or eliminates system I/O bottlenecks
          • reduces significantly the memory footprint
            • ⇒ query performance can improve 
              • because SQL Server can perform more query and data operations in memory
    • {benefit} built in memory
      • ⇒ sufficient memory must be available 
    • {benefit} part of the database engine
      • no special hardware is needed
    • {concept} columnstore 
      • {def} data structure logically organized as a table with rows and columns, and physically stored in a column-wise data format
        • stores values from the same domain which commonly have similar values
          • when a query references a column, then only that column is fetched from disk [3]
            • ⇐ the columns not requested are skipped 
              • ⇒ they are not loaded into memory 
          • when a query is executed, the rows must be reconstructed
            • ⇒ row reconstruction takes some time and uses some CPU and memory resources [3]
          • [SQL Server 2016] columnstore index on rowstore tables
            • columnstore is updated when data changes in the rowstore table
          • both indexes work against the same data
    • {concept}rowstore
      • {def} data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format
        • ⇐ the traditional way to store relational table data
        • refers to a table where the underlying data storage format is either
          • a heap
          • a clustered index
          • a memory-optimized table
    • {concept} rowstore index
      • performs best on queries that seek into the data, when searching for a particular value, or for queries on a small range of values
        • ⇒ appropriate for transactional workloads 
          • because they tend to require mostly table seeks instead of table scans
    • {concept} rowgroup
      • {def} a group of rows that are compressed into columnstore format at the same time
        • {constraint} has a maximum number of rows per rowgroup, which is 1,048,576 =2^20 rows
        • contains one column segment for every column in the table
        • can have more than one delta rowgroup that form the deltastore
          • e.g. when multiple threads create columnstore indexes using parallel execution plans [5]
            • ⇐ each thread will work with its own subset of data, creating separate rowgroups [5]
          • [partitions] each table partition has its own set of row groups [5]
            • ⇐  too many partitions may prevent workloads from benefiting from a CCI [11]
              • ⇐ data aren’t pushed into a compressed columnstore segment until the rowgroup limit is reached
      • {event} rowgroup is compressed
        • marked as read-only [16]
        • a compressed rowgroup is considered as fragmented when either 
          • row number < rowgroup limit but dictionary size reached the maximum
            • nothing can be done to increase the number of rows [15]
            • the trim_reason is other than DICTIONARY_SIZE
          • it has nonzero deleted rows that exceeds a minimum threshold [15]
      • {event} all data from rowgroup deleted 
        • transitions from COMPRESSED into TOMBSTONE state
        • later removed by the tuple-mover background process
      • {event} rows in the columnstore indexes can be moved to different locations
        • row-id in the nonclustered indexes aren’t updated 
          • ⇐ the mappings between old and new row locations are stored in an internal structure (aka mapping index) 
      • {event} rowgroup build
        • all column data are combined on a per-row group basis, encoded and compressed [5]
          • the rows within a row group can be rearranged if that helps to achieve a better compression rate [5]
    • {feature} data compression
      • the table is sliced into rowgroups, and each rowgroup is compresses in a column-wise manner
        • the number of rows in the rowgroup must be 
          • large enough to improve compression rates
          • small enough to benefit from in-memory operations
            • having too many small rowgroups decreases columnstore index’s quality
      • uses its own compression mechanism 
        • ⇒ row or page compression cannot be used on it [3]
        • [SQL Server 2016] page compression has been removed
          • ⇐ in some cases, page compression disallowed the creation of columnstore indexes with a very large number of columns [5]
    • {feature} compression delay
      • computed when a delta rowgroup is closed [7]
      • keeps the ‘active’ rows in delta rowgroup and only transition these rows to compressed rowgroup after a specified delay [7]
        • ⇐ reduces the overall maintenance overhead of NCCI [7]
        • ⇒ leads to a larger number of delta rowgroups [7]
      • {best practice} if the workload is primarily inserting data and querying it, the default COMPRESSION_DELAY of 0 is the recommended option [7]
      • {best practice} [OLTP workload] if > 10% rows are marked deleted in recently compressed rowgroups, then consider a value that accommodates the behavior [7]
        • via: create nonclustered columnstore index with (compression_delay= 150)
    • {feature} data encoding
      • all values in the data are replaced with 64-bit integers using one of two encoding algorithms
      • {concept} dictionary encoding
        • stores distinct values from the data in a separate structure (aka dictionary} 
          • every value in a dictionary has a unique ID assigned [5]
            • the ID is used for replacement
      • {concept} global dictionary
        • shared across all segments that belong to the same index partition [5]
      • {concept} local dictionary
        • created for individual segments using values that are not present in the global dictionary
      • {concept} value-based encoding
        • mainly used for numeric and integer data types that do not have enough duplicated values [5]
          • dictionary encoding would be inefficient [5]
        • converts integer and numeric values to a smaller range of 64-bit integers in 2 steps
          • {step} [numeric data types] are converted to integers using the minimum positive exponent (aka magnitude that allows this conversion) [5]
            • {goal} convert all numeric values to integers [5]
            • [integer data types] the smallest negative exponent is chosen that can be applied to all values without losing their precision [5]
              • {goal} reduce the interval between the minimum and maximum values stored in the segment [5]
          • {step} the minimum value (aka base value) in the segment is identified and subtracted it from all other values [5]
            • ⇒ makes the minimum value in the segment number 0 [5]
        • after encoding the data are compressed and stored as a LOB allocation unit
    • {concept} column segment 
      • {def} a column of data from within the rowgroup
      • is compressed together and stored on physical media
      • SQL Server loads an entire segment to memory when it needs to access its data
    • {concept} segment metadata 
      • store metadata about each segment 
        • e.g. minimum and maximum values
        • ⇐ segments that do not have the required data are skipped [5]
    • {concept} deltastore
      • {def} all of the delta rowgroups of a columnstore index
      • its operations are handled behind the scenes
        • can be in either states
          • {state} open (aka open delta store) 
            • accepts new rows and allow modifications and deletions of data
          • {state} closed (aka closed data store)
            • a delta store is closed when it reaches its rowgroup limit
    • {concept} delta rowgroup 
      • {def} a clustered B-tree index that's used only with columnstore indexes
      • improves columnstore compression and performance by storing rows until the number of rows reaches the rowgroup limit and are then moved into the columnstore
      • {event} reaches the maximum number of rows
        • it transitions from an ‘open’ to ‘closed’ state
        • a closed rowgroup is compressed by the tuple-mover and stored into the columnstore as COMPRESSED rowgroup
      • {event} compressed
        • the existing delta rowgroup transitions into TOMBSTONE state to be removed later by the tuple-mover when there is no reference to it
    • {concept} tuple-mover 
      • background process that checks for closed row group
        • if it finds a closed rowgroup, it compresses the delta rowgroup and stores it into the columnstore as a COMPRESSED rowgroup
    • {concept} clustered columnstore index (CCI) 
      • is the primary storage for the entire table
      • {characteristic) updatable
        • has two structures that support data modifications
          • ⇐ both use the B-Tree format to store data [5]
          • ⇐ created on demand [5]
          • delete bitmap
            • indicates which rows were deleted from a table
            • upon deletion the row continues to be stored into the rowgroup
            • during query execution SQL Server checks the delete bitmap and excludes deleted rows from the processing [5]
          • delta store
            • includes newly inserted rows
            • updating a row triggers the deletion of the existing row and insertion of a new version of a row to a delta store
              • ⇒ the update does not change the row data
              • ⇒ the updated row is inserted to a delete bitmap
          • [partitions] each partition can have a single delete bitmap and multiple delta stores
            • ⇐ this makes each partition self-contained and independent from other partitions
              • ⇒ allows performing a partition switch on tables that have clustered columnstore indexes defined [5]
      • {feature} supports minimal logging for batch sizes >= rowgroup’s limit [12]
      • [SQL Server 2017] supports non-persisted computed columns in clustered columnstore indexes [2]
      • store some data temporarily into a clustered index (aka deltastore) and a btree list of IDs for deleted rows
        • ⇐ {benefit} reduces fragmentation of the column segments and improves performance
        • combines query results from both the columnstore and the deltastore to return the correct query results
      • [partitions] too many partitions can hurt the performance of a clustered columnstore index [11]
    • {concept} nonclustered columnstore index (NCCI)
      • {def} a secondary index that's created on a rowstore table
        • is defined as one or more columns of the table and has an optional condition that filters the rows
        • designed to be used for workloads involving a mix of transactional and analytics workload*
        • functions the same as a clustered columnstore index
          • ⇐ has same performance optimizations (incl. batchmode operators)
          • {exception} doesn’t supports persisted computed columns
            • can’t be created on a columnstore index that has a computed column [2]
          • however behave differently between the various versions of SQL Server
            • [SQL Server 2012|2014] {restriction} readonly
        • contains a copy of part or all of the rows and columns in the underlying table
          • include a row-id , which is either the address of
            • a row in a heap table 
            • a clustered index key value
              • includes all columns from the clustered index even when not explicitly defined in the CREATE statement
                • the not specified columns will not be available in the sys.index_columns view
        • [SQL Server 2016] multiple nonclustered rowstore indexes can be created on a columnstore index and perform efficient table seeks on the underlying columnstore
          • ⇒ once created, makes it possible to drop one or more btree nonclustered indexes
        • enables real-time operational analytics where the OLTP workload uses the underlying clustered index while analytics run concurrently on the columnstore index
    • {concept} batch mode execution (aka vector-based execution, vectorized execution
      • {def} query processing method used to process multiple rows together in groups of rows, or batches, rather than one row at a time
        • SQL Server can push a predicate to the columnstore index scan operator, preventing unnecessary rows from being loaded into the batch [5]
        • queries can process up to 900 rows together
          • enables efficient query execution (by a 3-4x factor) [4]
          • ⇐ the size of the batches varies to fit into the CPU cache
          • ⇒ reduces the number of times that the CPU needs to request external data from memory or other components [5]
        • improves the performance of aggregations, which can be calculated on a per-batch rather than a per-row basis [5]
        • tries to minimize the copy of data between operators by creating and maintaining a special bitmap that indicates if a row is still valid in the batch [5]
          • ⇐ subsequent operators will ignore the non-valid rows
          • every operator has a queue of work items (batches) to process [5]
          • worker threads from a shared pool pick items from queues and process them while migrating from operator to operator [5]
      • is closely integrated with, and optimized around, the columnstore storage format.
        • columnstore indexes use batch mode execution
          • ⇐ improves query performance typically by two to four times
    • {concept} tuple mover
      • single-threaded process that works in the background, preserving system resources
        • runs every five minutes
      • converts closed delta stores to row groups that store data in a column-based storage format [5]
        • can be disabled via trace flag T-634 
        • ⇐ the conversion of closed delta stores to row groups can be forced by reorganizing an index [5]
          • runs in parallel using multiple threads
            • decreases significantly conversion time at a cost of extra CPU load and memory usage [5]
          • via: ALTER INDEX REORGANIZE command
      • it doesn’t prevent other sessions from inserting new data into a table [5]
      • deletions and data modifications would be blocked for the duration of the operation [5]
        • {recommendation} consider forcing index reorganization manually to reduce execution, and therefore locking, time [5]
      • considered fragmented if it has
        • multiple delta rowgroups
        • deleted rows
      • require maintenance like that of regular B-Tree indexes [5]
        • {issue] partially populated row groups
        • {issue} overhead of delta store and delete bitmap scans during query execution
        • rebuilding the columnstore index addresses the issues
        • the strategy depends on the volatility of the data and the ETL processes implemented in the system [5]
          • {recommendation} rebuild indexes when a table has a considerable volme of deleted rows and/or a large number of partially populated rowgroups [5]
          • {recommendation} rebuild partition(s) that still have a large number of rows in open delta stores after the ETL process has completed, especially if the ETL process does not use a bulk insert API [5]
        • creating/dropping/disabling/rebuilding functions like any other index
      • columnstore statistics 
        • a statistics object is created at the time of columnstore index creation; however, it is neither populated nor updated afterward [5]
          • ⇐ SQL Server relies on segment information, B-Tree indexes (when available), and column-level statistics when deciding if a columnstore index needs to be used [5]
          • it is beneficial to create missing column-level statistics on the columns that participate in a columnstore index and are used in query predicates and as join keys [5]
            • ⇐ statistics rarely update automatically on very large tables [5]
              • ⇒ statistics must be updated ‘manually’
        • [SQL Server 2019] included into the schema-only clone of a database functionality [8]
          • enable performance troubleshooting without the need to manual capture the statistics information
        • columnstore indexes has been added to sp_estimate_data_compression_savings. In SQL Server 2019 both 
        • COLUMNSTORE and COLUMNSTORE_ARCHIVE have been added to allow you to estimate the space savings if 
        • either of these indexes are used on a table.
          • via DBCC CLONEDATABASE
      • [in-memory tables] 
        • {limitation} a columnstore index must include all the columns and can’t have a filtered condition [2]
        • {limitation} queries on columnstore indexes run only in InterOP mode, and not in the in-memory native mode [2]
      • {operation} designing columnstore indexes
        • {best practice} understand as much as possible data’s characteristics
        • {best practice} identify workload’s characteristics
      • {operation} create a clustered columnstore index
        • via CREATE CLUSTERED COLUMNSTORE INDEX command
        • not needed to specify any columns in the statement
          • ⇐ the index will include all table columns
      • {operation} index rebuilding 
        • forces SQL Server to remove deleted rows physically from the index and to merge the delta stores’ and row groups’ data [5]
          • all column segments are recreated with row groups fully populated [5]
        • [<SQL Server 2019] offline operation
        • [SQL Server 2019 Enterprise] online operation
          • ⇒ higher availability 
          • ⇐ pausing and resuming create and rebuild operations are not supported [11]
        • very resource intensive process
        • holds a schema modification (Sch-M) lock on the table
          • ⇒ prevents other sessions from accessing it [5]
          • ⇐ the overhead can be mitigated by using table/index partitioning
            • ⇒ indexes will be rebuild on a partition basis for those partition with volatile data [5]
      • {operation} index reorganization 
        • [<SQL Server 2019] a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups
        • [SQL Server 2019] a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted
          • ⇐ after merging smaller rowgroups, the index quality should be improved.
          • the tuple-mover is helped by a background merge task that automatically compresses smaller OPEN delta rowgroups that have existed for some time as determined by an internal threshold, or merges COMPRESSED rowgroups from where a large number of rows has been deleted
          • via: ALTER INDEX REORGANIZE command
        • [SQL Server 2016] performs additional defragmentation
          • removes deleted rows from row groups that have 10 or more percent of the rows logically deleted [5]
          • merges closed row groups together, keeping the total number of rows less than or equal than rowgroup’s limit [5]
          • ⇐ both processes can be done together [5]
        • [SQL Server 2014] the only action performed is compressing and moving the data from closed delta stores to rowgroups [5] 
          • ⇐ delete bitmap and open delta stores stay intact [5]
        • via: ALTER INDEX REORGANIZE
          • uses all available system resources while it is running [5]
            • ⇒ speeds up the execution process 
            • reduce the time during which other sessions cannot modify or delete data in a table [5]
          • close and compress all open row groups
            • via: ALTER INDEX REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON)
            • row groups aren’t merged during this operation [5]
      • {operation} estimate compression savings
        • [SQL Server 2019] COLUMNSTORE and COLUMNSTORE_ARCHIVE added
          • allows estimating the space savings if either of these indexes are used on a table [8]
          • {limitation} not available in all editions 
        • via: sp_estimate_data_compression_savings 
      • {operation} [bulk loads] when the number of rows is less than deltastore’s limit, all the rows go directly to the deltastore
        • [large bulk load] most of the rows go directly to the columnstore without passing through the deltastore
          • some rows at the end of the bulk load might be too few in number to meet the minimum size of a rowgroup
            • ⇒ the final rows go to the deltastore instead of the columnstore
        • bulk insert operations provide the number of rows in the batch as part of the API call [5]
          • best results are achieved by choosing a batch size that is divisible by rowgroup’s limit [5]
            • ⇐ guarantees that every batch produces one or several fully populated row groups [5]
              • ⇒ reduce the total number of row groups in a table [5]
              • ⇒ improves query performance
            • ⇐ the batch size shouldn’t exceed rowgroup’s limit [5]
              • row groups can be still created on the fly in a manner to similar a bulk insert when the size of the insert batch is close to or exceeds [5]
    • {operation} [non-bulk operations] trickle inserts go directly to a delta store
    • {feature} parallel inserts
      • [SQL Server 2016] requires following conditions for parallel insert on CCI [6]
        • must specify TABLOCK
        • no NCI on the clustered columnstore index
        • no identity column
        • database compatibility is set to 130
    • {recommendation} minimize the use of string columns in facts tables [5]
      • string data use more space
      • their encoding involves additional overhead during batch mode execution [5]
      • queries with predicates on string columns may have less efficient execution plans that also require significantly larger memory grants as compared to their non-string counterparts [5]
    • {recommendation} [SQL Server 2012|2014] do not push string predicates down toward the lowest operators in execution plans.
    • {recommendation} add another dimension table and replace the string value in the facts table with a synthetic, integer-based ID key that references a new table [5]
    • {operation} upgrading to SQL Server 2016
      • make sure that queries against the tables with columnstore indexes can utilize parallelism in case if database compatibility level less than 130 [5]
    • {feature} [SQL Server 2019] automated columnstore index maintenance [8]
    • {improvement} [SQL Server 2019] better columnstore metadata memory management
    • {improvement} [SQL Server 2019] low-memory load path for columnstore tables
    • {improvement} [SQL Server 2019] improved performance for bulk loading to columnstore indexes
    • {improvement} [SQL Server 2019] server startup process has been made faster for databases that use in-memory columnstore tables for HTAP
    • {feature} DMVs

    References:
    [1] SQL Docs (2020) Columnstore indexes: Overview [link]
    [2] Microsoft Learn (2024) SQL: What's new in columnstore indexes  [link]
    [3] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
    [4] SQL Docs (2019) Columnstore indexes - Query performance [link]
    [5] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
    [6] Microsoft Learn (2016) Columnstore Index: Parallel load into clustered columnstore index from staging table [link]
    [7] Microsoft Learn (2016) Columnstore Index Defragmentation using REORGANIZE Command [link]
    [8] Microsoft (2018) Microsoft SQL Server 2019: Technical white paper [link]

    Acronyms:
    CCI - clustered columnstore index
    CI - columnstore index
    DBCC - Database Console Commands
    DMV - Dynamic Management View
    ETL - Extract, Transform, Load
    HTAP - Hybrid Transactional/Analytical Processing 
    LOB - Line of Business
    NCCI - nonclustered columnstore index
    OLTP - On-Line Transaction Processing
    SP - Service Pack

    15 February 2025

    🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

    Business Intelligence Series
    Business Intelligence Series
    There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

    A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

    Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

    This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

    One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes. 

    There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

    Previous Post <<||>> Next Post

    References:
    [1] Alexander Pope (cca. 1711) An Essay on Criticism

    14 February 2025

    🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

     Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 14-Feb-2024

    [Microsoft Fabric] Partitions

    • {def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards
      • a pattition is defined based on one or more fields
      • each partition contains a subset of the data
      • each partitions can be stored and processed independently
    • {goal} improve performance, scalability, and manageability of large data tables
    • {benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]
      • e.g., date ranges, regions, categories, entities
    • {benefit} allows to improve queries' performance as they can target specific partitions [2]
      • reduces the amount of data scanned [2]
      • improves queries' performance [2]
    • {benefit} allows for more efficient data loading [2]
    • {benefit} facilitates the management of big tables [2]
      • maintenance tasks can be performed on individual partitions  [2]
      • obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]
    • applies to 
      • backups
      • indexing
      • allows optimizing query performance for specific subsets of data
      • statistics
    • performance can be affected by
      • the choice of partition columns for a delta table [1]
      • the number and size of partitions of the partition column [1]
      • a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]
        • ⇐ negatively impacts performance of the metadata discovery scan for changes [1]
        • {recommendation} if the cardinality of a column is high, choose another column for partitioning [1]
      • the size of each partition can also affect performance
        • {recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
        • {recommendation} follow the best practices for delta tables maintenance [1]
        • a large volume of small-sized parquet files increases the time it takes to sync the changes [1]
          • ⇒ leads to large number of parquet files in a delta table due to [1]
            • over-partitioning
              • partitions with high number of unique values [1]
              • {recommendation} choose a partition column that 
                • doesn't have a high cardinality [1]
                • results in individual partition size of at least 1 GB [1]
            • many small files
              • batch and streaming data ingestion rates might result in small files
                • depends on the frequency and size of changes being written to a lakehouse [1]
              • {recommendation} implement regular lakehouse table maintenance [1] 


    References:
    [1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
    [2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance
    written [link]
    [3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]
    [4] 

    Resources
    [R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    13 February 2025

    🧊💠🗒️Data Warehousing: Table Partitioning in SQL Server [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

    Last updated: 13-Feb-2025

    [Data Warehousing] Table Partitioning

    • {defthe spreading of data across multiple tables based on a set of rules to balance large amounts of data across disks or nodes
      • data is distributed based on a function that defines a range of values for each partition [2] 
        • the table is partitioned by applying the partition scheme to the values in a specified column [2]
    • {operation} partition creation
      • [large partitioned table]
        • should be created two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options [4]
          • first table:  create a check constraint that guarantees that all data from the table fits exactly with one empty partition of the fact table
            • the constraint must be created on the partitioning column [4]
            • a columnstore index can be created on the fact table, as long as it is aligned with the table [4]
            • after truncation of <table 2> the <table 1> is prepared to accept the next partition from your fact table for the next minimally logged deletion [4]
          • second table: for minimally logged deletions of large portions of data, a partition from the fact table can be switched to the empty table version without the check constraint [4]
            • then the table can be truncated
        • for minimally logged inserts, new data to the second auxiliary table should be bulk inserted in the auxiliary that has the check constraint [4]
          • INSERT operation can be minimally logged because the table is empty [4]
          • create a columnstore index on this auxiliary table, using the same structure as the columnstore index on your fact table [4]
          • switch data from this auxiliary table to a partition of your fact table [4]
          • drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of the fact table [4]
          • the second auxiliary table is prepared for new bulk loads again [4]
    • {operation} [Query Optimizer] partition elimination 
      • process in which SQL Server accesses only those partitions needed to satisfy query filters [4]
    • {operation} partition switching 
      • {definition} process that switches a block of data from one table or partition to another table or partition [4]
      • types of switches
        • reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table [4]
        • switch a partition of one partitioned table to a partition of another partitioned table [4]
        • reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table [4]
    • {benefit} improves query performance [1]
      • by partitioning a table across filegroups [1]
        • specific ranges of data can be placed on different disk spindles [1]
          • can improve I/O performance [1]
            • ⇐ the disk storage is already configured as a RAID 10 or RAID 5 array [1]
              • ⇒ this usually has little benefit [1]
        • using a mix of fast solid state storage for recent, frequently accessed data, and mechanical disks for older, less queried rows [1]
          • use partitioning to balance disk performance against storage costs [1]
        • biggest performance gain from partitioning in a data warehouse is realized when queries return a range of rows that are filtered on the partitioning key [1]
          • the query optimizer can eliminate partitions that are not within the filter range [1]
            • dramatically reduce the number of rows that need to be read [1]
      • reduces contention [3]
        • can reduce the number of rows included in a table scan [3]
    • {benefit} more granular manageability [1]
      • some maintenance operations can be performed at partition level instead of on the whole table [1]
        • e.g. indexes can be created and rebuilt on a per-partition basis [1]
        • e.g. compression can be applied to individual partitions [1]
        • e.g. by mapping partitions to filegroups, partitions can be backed up and restored independently [1]
          • enables to back up older data once and then configure the backed up partitions as read-only [1]
          • future backups can be limited to the partitions that contain new or updated data [1]
    • {benefit} improved data load performance
      • enables loading many rows very quickly by switching a staging table with a partition
        • can dramatically reduce the time taken by ETL data loads [1]
          • with the right planning, it can be achieved with minimal requirements to drop or rebuild indexes [1]
    • {best practice} partition large fact tables
      • tables of around 50 GB or more
      • ⇐ in general, fact tables benefit from partitioning more than dimension tables [1]
    • {best practice} partition on an incrementing date key [1]
      • assures that the most recent data are in the last partition and the earliest data are in the first partition [1]
    • {best practice} design the partition scheme for ETL and manageability [1]
      • the query performance gains realized by partitioning are small compared to the manageability and data load performance benefits [1]
        • ideally partitions should reflect the ETL load frequency
          • because this simplifies the load process [1]
          • merge partitions periodically to reduce the overall number of partitions (for example, at the start of each year [1]
        • could merge the monthly partitions for the previous year into a single partition for the whole year [1]
    • {best practice} maintain an empty partition at the start and end of the table [1]
      • simplifies the loading of new rows [1]
      • when new records need to be added, split the empty partition 
        • ⇐ to create two empty partitions)
      • then switch the staged data with the first empty partition [1]
        • ⇐ loads the data into the table and leaves the second empty partition you created at the end of the table, ready for the next load [1]
      • a similar technique can be used to archive or delete obsolete data at the beginning of the table [1]
    • {best practice} chose the proper granularity
      • it should be aligned to the business requirements [2]
    • {best practice} create at least one filegroup in addition to the primary one
      • set it as the default filegroup
        • data tables are thus separated from system tables [2]
      • creating dedicated filegroups for extremely large fact tables [2]
        • place the fact tables on their own logical disks [2]
      • use a file and a filegroup separate from the fact and dimension tables [2]
        • {exception} staging tables that will be switched with partitions to perform fast loads [2]
          • staging tables must be created on the same filegroup as the partition with which they will be switched [2]
    • {def} partition scheme 
      • a scheme that maps partitions to filegroups
    • {def} partition function 
      • object that maps rows to partitions by using values from specific columns (aka  partitioning columns)
      • performs logical mapping
    • {def} aligned index 
      • index built on the same partition scheme as its base table [4]
        • if all indexes are aligned with their base table, switching a partition is a metadata operation only [4]
          • ⇒ it’s very fast [4]
    Previous Post <<||>> Next Post

    References:
    [1] 20467A - Designing Business Intelligence Solutions with Microsoft SQL Server 2012
    [2] 20463C - Implementing a Data Warehouse with Microsoft SQL Server
    [3] 10777A - Implementing a Data Warehouse with Microsoft SQL Server 2012
    [4] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)
    [5] Microsoft Learn (2009) How to Implement an Automatic Sliding Window in a Partitioned Table on SQL Server 2005 [link]

    🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 12-Feb-2024

    [Microsoft Fabric] SQL Analytics Endpoint

    • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
      • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
      • created by default for each lakehouses and mirrored databases provisioned in a workspace
        • every lakehouse has one SQL analytics endpoint [1]
      • manages the automatically generated tables so the workspace users can't modify them [1]
    • {feature} a background process is responsible for 
      • scanning lakehouse for changes [1]
        • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
          • tables are created with a minor delay [1]
          • data is refreshed automatically
            • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
          • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
      • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
        • {characteristic} transparently managed
          • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
          • when a change is detected in the lakehouse
            • a background process updates metadata [1]
            • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
        • {characteristic} low latency
          • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
          • the actual length of time can vary from a few seconds to minutes [1]
        • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
          • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
            •  ⇐ provides a seamless experience [3]
    • {feature} can be enriched by adding database objects
      • schemas
      • views
      • procedures
      • other database objects
    • {feature} automatic metadata discovery
      • tracks changes committed to lakehouses [1]
        • is a single instance per Fabric workspace [1]
        • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
          • {resolution} migrate each lakehouse to a separate workspace [1]
            • allows automatic metadata discovery to scale [1]
        • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
          • create a new table in the  [1]
          • many committed rows may not be visible in the endpoint [1]
          • ⇐ may be impacted by ETL processing that generate large volumes of changes
          • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
            • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
        • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
          • {recommendation}schedule regular lakehouse table maintenance operations [1]
        • {limitation} not all Delta features are understood by the automatic sync process [1]
    • {feature} endpoint reprovisioning
      • if the initial provisioning attempt fails, users have the option to try again [2]
        • ⇐ without the need to create an entirely new lakehouse [2]
        • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

    References:
    [1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
    [3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.