Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!
Last updated: 28-Mar-2024
[Microsoft Fabric] Hadoop
- Apache software library
- backend technology that make storing data and running large-scale parallel computations possible
- open-source framework
- widely adopted
- implements special versions of the HDFS
- enables applications to scale to petabytes of data employing commodity hardware
- based on MapReduce API
- software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
- {function} Mapper
- consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]
- ⇐ analysis usually involve filter and sorting operations) [2]
- {function} Reducer
- consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]
- {benefit} economical scalable storage mode
- can run on commodity hardware that in turn utilizes commodity disks
- the price point per terabyte is lower than that of almost any other technology [1]
- {benefit} massive scalable IO capability
- aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
- adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]
- ⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]
- {characteristic} reliability
- enabled by fault-tolerant design
- ability to replicate by MapReduce execution
- ⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes
- data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1]
- ⇐ failure of a server does not result in a loss of data [1]
- ⇐ the job continues even if a server fails
- ⇐ the processing switches to another server [1]
- every piece of data is usually replicated across three nodes
- ⇐ can be located on separate server racks to avoid any single point of failure [1]
- {characteristic} scalable processing model
- MapReduce represents a widely applicable and scalable distributed processing model
- capable of brute-forcing acceptable performance for almost all algorithms [1]
- not the most efficient implementation for all algorithms
- {characteristic} schema on read
- the imposition of structure can be delayed until the data is accessed
- ⇐as opposed to the schema on write mode
- ⇐used by relational data warehouses
- data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]
- {advantage} data can be quickly ingest from the various forms [1]
- this is sometimes referred to as schema on read, [1]
- {architecture} Hadoop 1.0
- mixed nodes
- the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]
- each server supplies both data storage and processing capacity (CPU and memory) [1]
- specialized nodes
- job tracker node
- coordinates the scheduling of jobs run on the Hadoop cluster [1]
- name node
- sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]
- {disadvantage} architecture limited to MapReduce workloads [1]
- {disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]
- {architecture} Hadoop 2.0
- layers on top of the Hadoop 1.0 architecture [1]
- {concept} YARN (aka Yet Another Resource Negotiator)
- improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]
- {process} Resource Manager
- controls access to the clusters resources (memory, CPU)
- {process} Application Manager
- (one per job) controls task execution
- treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]
- allows Hadoop to run tasks based on more complex processing models [1]
- {concept} Distributed File System
- a protocol used for storage and replication of data [1]
References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data