Area | Lakehouse | Warehouse | Eventhouse | Fabric SQL database | Power BI Datamart |
Data volume | Unlimited | Unlimited | Unlimited | 4 TB | Up to 100 GB |
Type of data | Unstructured, semi-structured, structured | Structured, semi-structured (JSON) | Unstructured, semi-structured, structured | Structured, semi-structured, unstructured | Structured |
Primary developer persona | Data engineer, data scientist | Data warehouse developer, data architect, data engineer, database developer | App developer, data scientist, data engineer | AI developer, App developer, database developer, DB admin | Data scientist, data analyst |
Primary dev skill | Spark (Scala, PySpark, Spark SQL, R) | SQL | No code, KQL, SQL | SQL | No code, SQL |
Data organized by | Folders and files, databases, and tables | Databases, schemas, and tables | Databases, schemas, and tables | Databases, schemas, tables | Database, tables, queries |
Read operations | Spark, T-SQL | T-SQL, Spark* | KQL, T-SQL, Spark | T-SQL | Spark, T-SQL |
Write operations | Spark (Scala, PySpark, Spark SQL, R) | T-SQL | KQL, Spark, connector ecosystem | T-SQL | Dataflows, T-SQL |
Multi-table transactions | No | Yes | Yes, for multi-table ingestion | Yes, full ACID compliance | No |
Primary development interface | Spark notebooks, Spark job definitions | SQL scripts | KQL Queryset, KQL Database | SQL scripts | Power BI |
Security | RLS, CLS**, table level (T-SQL), none for Spark | Object level, RLS, CLS, DDL/DML, dynamic data masking | RLS | Object level, RLS, CLS, DDL/DML, dynamic data masking | Built-in RLS editor |
Access data via shortcuts | Yes | Yes | Yes | Yes | No |
Can be a source for shortcuts | Yes (files and tables) | Yes (tables) | Yes | Yes (tables) | No |
Query across items | Yes | Yes | Yes | Yes | No |
Advanced analytics | Interface for large-scale data processing, built-in data parallelism, and fault tolerance | Interface for large-scale data processing, built-in data parallelism, and fault tolerance | Time Series native elements, full geo-spatial and query capabilities | T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics | Interface for data processing with automated performance tuning |
Advanced formatting support | Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format | Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format | Full indexing for free text and semi-structured data like JSON | Table support for OLTP, JSON, vector, graph, XML, spatial, key-value | Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format |
Ingestion latency | Available instantly for querying | Available instantly for querying | Queued ingestion, streaming ingestion has a couple of seconds latency | Available instantly for querying | Available instantly for querying |
It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.
However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!
For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag.
The warehouse seems to be more versatile, though careful attention needs to be given to its design.
The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas.
Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios.
Power BI datamarts have been in preview for a couple of years.
References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]
[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]
No comments:
Post a Comment