SQL Troubles: 🏭🗒️Microsoft Fabric: Hadoop [Notes]

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

Apache software library

backend technology that make storing data and running large-scale parallel computations possible
open-source framework
widely adopted
implements special versions of the HDFS

enables applications to scale to petabytes of data employing commodity hardware

based on MapReduce API

software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
{function} Mapper

consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]

⇐ analysis usually involve filter and sorting operations) [2]

{function} Reducer

consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]

{benefit} economical scalable storage mode

can run on commodity hardware that in turn utilizes commodity disks

the price point per terabyte is lower than that of almost any other technology [1]

{benefit} massive scalable IO capability

aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]

⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]

{characteristic} reliability

enabled by fault-tolerant design
ability to replicate by MapReduce execution

⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes

data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1]

⇐ failure of a server does not result in a loss of data [1]

⇐ the job continues even if a server fails

⇐ the processing switches to another server [1]

every piece of data is usually replicated across three nodes

⇐ can be located on separate server racks to avoid any single point of failure [1]

{characteristic} scalable processing model

MapReduce represents a widely applicable and scalable distributed processing model
capable of brute-forcing acceptable performance for almost all algorithms [1]

not the most efficient implementation for all algorithms

{characteristic} schema on read

the imposition of structure can be delayed until the data is accessed
⇐ as opposed to the schema on write mode
⇐ used by relational data warehouses
data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]

{advantage} data can be quickly ingest from the various forms [1]

this is sometimes referred to as schema on read, [1]

{architecture} Hadoop 1.0

mixed nodes

the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]

each server supplies both data storage and processing capacity (CPU and memory) [1]

specialized nodes

job tracker node

coordinates the scheduling of jobs run on the Hadoop cluster [1]

name node

sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]

{disadvantage} architecture limited to MapReduce workloads [1]
{disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]

{architecture} Hadoop 2.0

layers on top of the Hadoop 1.0 architecture [1]
{concept} YARN (aka Yet Another Resource Negotiator)

improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]

{process} Resource Manager

controls access to the clusters resources (memory, CPU)

{process} Application Manager

(one per job) controls task execution

treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]

allows Hadoop to run tasks based on more complex processing models [1]

{concept} Distributed File System

a protocol used for storage and replication of data [1]

Previous Post <<||>> Next Post

Acronyms:

DFS - Distributed File System

DWH - Data Warehouse

HDFS - Hadoop Distributed File System

YARN - Yet Another Resource Negotiator

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data

[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

SQL Troubles

Pages

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

No comments:

About Me