SQL Troubles: open source

Showing posts with label open source. Show all posts

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

Apache software library

backend technology that make storing data and running large-scale parallel computations possible
open-source framework
widely adopted
implements special versions of the HDFS

enables applications to scale to petabytes of data employing commodity hardware

based on MapReduce API

software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
{function} Mapper

consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]

⇐ analysis usually involve filter and sorting operations) [2]

{function} Reducer

consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]

{benefit} economical scalable storage mode

can run on commodity hardware that in turn utilizes commodity disks

the price point per terabyte is lower than that of almost any other technology [1]

{benefit} massive scalable IO capability

aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]

⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]

{characteristic} reliability

enabled by fault-tolerant design
ability to replicate by MapReduce execution

⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes

data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1]

⇐ failure of a server does not result in a loss of data [1]

⇐ the job continues even if a server fails

⇐ the processing switches to another server [1]

every piece of data is usually replicated across three nodes

⇐ can be located on separate server racks to avoid any single point of failure [1]

{characteristic} scalable processing model

MapReduce represents a widely applicable and scalable distributed processing model
capable of brute-forcing acceptable performance for almost all algorithms [1]

not the most efficient implementation for all algorithms

{characteristic} schema on read

the imposition of structure can be delayed until the data is accessed
⇐ as opposed to the schema on write mode
⇐ used by relational data warehouses
data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]

{advantage} data can be quickly ingest from the various forms [1]

this is sometimes referred to as schema on read, [1]

{architecture} Hadoop 1.0

mixed nodes

the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]

each server supplies both data storage and processing capacity (CPU and memory) [1]

specialized nodes

job tracker node

coordinates the scheduling of jobs run on the Hadoop cluster [1]

name node

sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]

{disadvantage} architecture limited to MapReduce workloads [1]
{disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]

{architecture} Hadoop 2.0

layers on top of the Hadoop 1.0 architecture [1]
{concept} YARN (aka Yet Another Resource Negotiator)

improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]

{process} Resource Manager

controls access to the clusters resources (memory, CPU)

{process} Application Manager

(one per job) controls task execution

treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]

allows Hadoop to run tasks based on more complex processing models [1]

{concept} Distributed File System

a protocol used for storage and replication of data [1]

Previous Post <<||>> Next Post

Acronyms:

DFS - Distributed File System

DWH - Data Warehouse

HDFS - Hadoop Distributed File System

YARN - Yet Another Resource Negotiator

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data

[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

11 April 2007

🌁Software engineering: Open Source (Definitions)

"A style of licensing software based on the principle that anyone should be able to copy, use, and improve upon a program's source code, although other restrictions may apply." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A movement in the software industry that makes programs available along with the source code used to create them so others can inspect and modify how programs work." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Software provided with all the source code available, enabling developers to contribute changes. Most open software is available for free; although, many companies also license commercial software on an open basis." (Jon Radoff, "Game On: Energize Your Business with Social Media Games", 2011)

"software created by the worldwide user community. Open source software is generally free, can be modified by anyone, and usually doesn't have any single “owner.” Outsourcing: the process of a company arranging with one or more third parties to provide services that the first company could provide but chose not to." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed, 2012)

"A movement in the software industry that makes programs available along with the source code used to create them so that others can inspect and modify how programs work. Changes to source code are shared with the community at large." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Software is open source if the source code is available to anyone who has access to the software." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A copyright or licensing system that, compared with conventional commercial licensing schemes, allows wide use and modification of the material." (Mike Harwood, "Internet Security: How to Defend Against Attackers on the Web" 2nd Ed., 2015)

"A program in which the source code is available to the general public for use or modification from its original design free of charge. Common Open Source licenses include the GNU General Public License, GNU Library General Public License, Artistic License, BSD license, Mozilla Public License, and other similar licenses listed at http://www.opensource.org/licenses. Open Source code is typically created as a collaborative effort in which programmers improve on the code and share the changes within the community." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

[Open source platform:] "refers to any program whose source code is made available for use or modification by other users or developers. An open source platform is usually developed as a public collaboration and made freely available." (Accenture)

"Software for which the source code is available under an open licence. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source." (Open Data Handbook)

06 October 2006

⛩️Eric S Raymond - Collected Quotes

"Good programmers know what to write. Great ones know what to rewrite." (Eric S Raymond, "The Cathedral and the Bazaar", 1999)

"If you have the right attitude, interesting problems will find you." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Smart data structures and dumb code works a lot better than the other way around." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Software is largely a service industry operating under the persistent but unfounded delusion that it is a manufacturing industry." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"The next best thing to having good ideas is recognizing good ideas from your users. Sometimes the latter is better." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"To solve an interesting problem, start by finding a problem that is interesting to you." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"When writing gateway software of any kind, take pains to disturb the data stream as little as possible - and never throw away information unless the recipient forces you to!" (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Ugly programs are like ugly suspension bridges: they're much more liable to collapse than pretty ones, because the way humans (especially engineer-humans) perceive beauty is intimately related to our ability to process and understand complexity. A language that makes it hard to write elegant code makes it hard to write good code." (Eric S. Raymond, "Why Python?", Linux Journal, 2000)

"A software system is transparent when you can look at it and immediately see what is going on. It is simple when what is going on is uncomplicated enough for a human brain to reason about all the potential cases without strain." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"All OO languages show some tendency to suck programmers into the trap of excessive layering. Object frameworks and object browsers are not a substitute for good design or documentation, but they often get treated as one. Too many layers destroy transparency: It becomes too difficult to see down through them and mentally model what the code is actually doing. The Rules of Simplicity, Clarity, and Transparency get violated wholesale, and the result is code full of obscure bugs and continuing maintenance problems." (Eric S. Raymond, "The Art of Unix Programming", 2003)

"Programmer time is expensive; conserve it in preference to machine time." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"The combination of threads, remote-procedure-call interfaces, and heavyweight object-oriented design is especially dangerous [...] if you are ever invited onto a project that is supposed to feature all three, fleeing in terror might well be an appropriate reaction." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"The only way to write complex software that won't fall on its face is to hold its global complexity down - to build it out of simple pieces connected by well-defined interfaces, so that most problems are local and you can have some hope of fixing or optimizing a part without breaking the whole." (Eric S Raymond, "The Art of UNIX Programming", 2003)

SQL Troubles

Pages