Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!
Last updated: 17-Mar-2024
[Microsoft Fabric] Caching- {def} technique that improves the performance of data processing by storing frequently accessed data and metadata in a faster storage layer [1]
- e.g. local memory, local SSD disk
- ⇐ subsequent requests can be served faster, directly from the cache [1]
- if a set of data has been previously accessed by a query, any subsequent queries will retrieve that data directly from the in-memory cache [1]
- local memory operations are notably faster compared to fetching data from remote storage [1]
- ⇐ significantly diminishes IO latency [1]
- fully transparent to the user
- consistently active and operates seamlessly in the background [1]
- orchestrated and upheld by MF
- it doesn't offer users the capability to manually clear the cache [1]
- provides transactional consistency
- ensures that any modifications to the data in storage after it has been initially loaded into the in-memory cache, will result in consistent data [1]
- when the cache reaches its capacity threshold and fresh data is being read for the first time, objects that have remained unused for the longest duration will be removed from the cache [1]
- process is enacted to create space for the influx of new data and maintain an optimal cache utilization strategy [1]
- {type} in-memory cache
- data in cache is organized in a compressed columnar format (aka columnar storage) [1]
- ⇐ optimized for analytical queries
- each column of data is stored separately [1]
- {benefit} allows for better compression [1]
- since similar data values are stored together [1]
- {benefit} reduces the memory footprint
- when queries need to perform operations on a specific column, the engine can work more efficiently speeding up queries' execution [1]
- it doesn't have to process unnecessary data from other columns
- can perform operations on multiple columns simultaneously [1]
- taking advantage of modern multi-core processors [1]
- when retrieves data from storage, the data is transformed from its original file-based format into highly optimized structures in in-memory cache [1]
- {scenario} analytical workloads where queries involve scanning large amounts of data to perform aggregations, filtering, and other data manipulations [1]
- {type} disk cache
- complementary extension to the in-memory cache
- any data loaded into the in-memory cache is also serialized to the SSD cache [1]
- data removed from the in-memory cache remains within the SSD cache for an extended period
- when subsequent query requests the data, it is retrieved from the SSD cache into the in-memory cache quicker [1]
- {issue} cold run (aka cold cache) performance
- the first 1-3 executions of a query perform noticeably slower than subsequent executions [2]
- if the first run's performance is crucial, try manually creating statistics (aka pre-warming the cache) [2]
- otherwise, one can rely on the automatic statistics generated in the first query run and leveraged in subsequent runs [2]
- as long as underlying data does not change significantly [2]
- differentiated from
IO - Input/Output
MF - Microsoft Fabric
SSD - Solid State Drive