Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

  • Apache software library
    • backend technology that make storing data and running large-scale parallel computations possible
    • open-source framework 
    • widely adopted 
    • implements special versions of the HDFS
      • enables applications to scale to petabytes of data employing commodity hardware
    • based on MapReduce API 
      • software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
      • {function} Mapper
        • consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]
        • ⇐ analysis usually involve filter and sorting operations) [2]
      • {function} Reducer
        • consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]
  • {benefit} economical scalable storage mode
    • can run on commodity hardware that in turn utilizes commodity disks
      • the price point per terabyte is lower than that of almost any other technology [1]
  • {benefit} massive scalable IO capability
    • aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
    • adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]
      • ⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]
  • {characteristic} reliability
    • enabled by fault-tolerant design
    • ability to replicate by MapReduce execution
      • ⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes
    • data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1] 
      • ⇐ failure of a server does not result in a loss of data [1]
        • ⇐ the job continues even if a server fails
          • ⇐ the processing switches to another server [1]
      • every piece of data is usually replicated across three nodes
        • ⇐ can be located on separate server racks to avoid any single point of failure [1]
  • {characteristic} scalable processing model
    • MapReduce represents a widely applicable and scalable distributed processing model
    • capable of brute-forcing acceptable performance for almost all algorithms [1]
      • not the most efficient implementation for all algorithms
  • {characteristic} schema on read
    • the imposition of structure can be delayed until the data is accessed
    • ⇐as opposed to the schema on write mode 
    • ⇐used by relational data warehouses
    • data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]
      • {advantage} data can be quickly ingest from the various forms [1]
        •  this is sometimes referred to as schema on read,  [1]
  • {architecture} Hadoop 1.0
    • mixed nodes
      • the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]
        • each server supplies both data storage and processing capacity (CPU and memory) [1]
    • specialized nodes
      • job tracker node 
        • coordinates the scheduling of jobs run on the Hadoop cluster [1]
      • name node 
        • sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]
      • {disadvantage} architecture limited to MapReduce workloads [1]
      • {disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]
  • {architecture} Hadoop 2.0 
    • layers on top of the Hadoop 1.0 architecture [1]
    • {concept} YARN (aka Yet Another Resource Negotiator)
      • improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]
        • {process} Resource Manager 
          • controls access to the clusters resources (memory, CPU)
        • {process} Application Manager 
          • (one per job) controls task execution
    • treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]
      • allows Hadoop to run tasks based on more complex processing models [1]
  • {concept} Distributed File System 
    • a protocol used for storage and replication of data [1]

Acronyms:
DFS - Distributed File System
DWH - Data Warehouse
HDFS - Hadoop Distributed File System
YARN - Yet Another Resource Negotiator 

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

07 February 2018

🔬Data Science: Hadoop (Definitions)

"An Apache-managed software framework derived from MapReduce and Bigtable. Hadoop allows applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency. Two major components of Hadoop exist: a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An open-source software platform developed by Apache Software Foundation for data-intensive applications where the data are often widely distributed across different hardware systems and geographical locations." (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Technology designed to house Big Data; a framework for managing data" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"an Apache-managed software framework derived from MapReduce. Big Table Hadoop enables applications based on MapReduce to run on large clusters of commodity hardware. Hadoop is designed to parallelize data processing across computing nodes to speed up computations and hide latency. The two major components of Hadoop are a massively scalable distributed file system that can support petabytes of data and a massively scalable MapReduce engine that computes results in batch." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"An open-source framework that is built to process and store huge amounts of data across a distributed file system." (Jason Williamson, "Getting a Big Data Job For Dummies", 2015)

"Open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"A batch processing infrastructure that stores fi les and distributes work across a group of servers. The infrastructure is composed of HDFS and MapReduce components. Hadoop is an open source software platform designed to store and process quantities of data that are too large for just one particular device or server. Hadoop’s strength lies in its ability to scale across thousands of commodity servers that don’t share memory or disk space." (Benoy Antony et al, "Professional Hadoop®", 2016)

"Apache Hadoop is an open-source framework for processing large volume of data in a clustered environment. It uses simple MapReduce programming model for reliable, scalable and distributed computing. The storage and computation both are distributed in this framework." (Kaushik Pal, 2016)

"A framework that allow for the distributed processing for large datasets." (Neha Garg & Kamlesh Sharma, "Machine Learning in Text Analysis", 2020)

 "Hadoop is an open source implementation of the MapReduce paper. Initially, Hadoop required that the map, reduce, and any custom format readers be implemented and deployed to the cluster. Eventually, higher level abstractions were developed, like Apache Hive and Apache Pig." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"A batch processing infrastructure that stores files and distributes work across a group of servers." (Oracle)

"an open-source framework that is built to enable the process and storage of big data across a distributed file system." (Analytics Insight)

"Apache Hadoop is an open-source, Java-based software platform that manages data processing and storage for big data applications. Hadoop works by distributing large data sets and analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads that can be run in parallel. Hadoop can process both structured and unstructured data, and scale up reliably from a single server to thousands of machines." (Databricks) [source]

"Hadoop is an open source software framework for storing and processing large volumes of distributed data. It provides a set of instructions that organizes and processes data on many servers rather than from a centralized management nexus." (Informatica) [source]

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.