Showing posts with label open source. Show all posts
Showing posts with label open source. Show all posts

06 October 2025

🏭🗒️Microsoft Fabric: Git [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 6-Oct-2025

[Microsoft Fabric] Git

  • {def} an open source, distributed version control platform
    • enables developers commit their work to a local repository and then sync their copy of the repository with the copy on the server [1]
    • to be differentiated from centralized version control 
      • where clients must synchronize code with a server before creating new versions of code [1
    • provides tools for isolating changes and later merging them back together
  • {benefit} simultaneous development
    • everyone has their own local copy of code and works simultaneously on their own branches
      •  Git works offline since almost every operation is local
  • {benefit} faster release
    • branches allow for flexible and simultaneous development
  • {benefit} built-in integration
    • integrates into most tools and products
      •  every major IDE has built-in Git support
        • this integration simplifies the day-to-day workflow
  • {benefit} strong community support
    • the volume of community support makes it easy to get help when needed
  • {benefit} works with any team
    • using Git with a source code management tool increases a team's productivity 
      • by encouraging collaboration, enforcing policies, automating processes, and improving visibility and traceability of work
    • the team can either
      • settle on individual tools for version control, work item tracking, and continuous integration and deployment
      • choose a solution that supports all of these tasks in one place
        • e.g. GitHub, Azure DevOps
  • {benefit} pull requests
    • used to discuss code changes with the team before merging them into the main branch
    • allows to ensure code quality and increase knowledge across team
    • platforms like GitHub and Azure DevOps offer a rich pull request experience
  • {benefit} branch policies
    • protect important branches by preventing direct pushes, requiring reviewers, and ensuring clean build
      •  used to ensure that pull requests meet requirements before completion
    •  teams can configure their solution to enforce consistent workflows and process across the team
  • {feature} continuous integration
  • {feature} continuous deployment
  • {feature} automated testing
  • {feature} work item tracking
  • {feature} metrics
  • {feature} reporting 
  • {operation} commit
    • snapshot of all files at a point in time [1]
      •  every time work is saved, Git creates a commit [1]
      •  identified by a unique cryptographic hash of the committed content [1]
      •  everything is hashed
      •  it's impossible to make changes, lose information, or corrupt files without Git detecting it [1]
    •  create links to other commits, forming a graph of the development history [2A]
    • {operation} revert code to a previous commit [1]
    • {operation} inspect how files changed from one commit to the next [1]
    • {operation} review information e.g. where and when changes were made [1]
  • {operation} branch
    •  lightweight pointers to work in progress
    •  each developer saves changes to their own local code repository
      • there can be many different changes based on the same commit
        •  branches manage this separation
      • once work created in a branch is finished, it can be merged back into the team's main (or trunk) branch
    • main branch
      • contains stable, high-quality code from which programmers release
    • feature branches 
      • contain work in progress, which are merged into the main branch upon completion
      •  allows to isolate development work and minimize conflicts among multiple developers [2]
    •  release branch
      •  by separating the release branch from development in progress, it's easier to manage stable code and ship updates more quickly
  • if a file hasn't changed from one commit to the next, Git uses the previously stored file [1]
  • files are in one of three states
    • {state}modified
      • when a file is first modified, the changes exist only in the working directory
        •  they aren't yet part of a commit or the development history
      •  the developer must stage the changed files to be included in the commit
      •  the staging area contains all changes to include in the next commit
    •  {state}committed
      •  once the developer is happy with the staged files, the files are packaged as a commit with a message describing what changed
        •  this commit becomes part of the development history
    •  {state}staged
      •  staging lets developers pick which file changes to save in a commit to break down large changes into a series of smaller commits
        •   by reducing the scope of commits, it's easier to review the commit history to 
  • {best practice} set up a shared Git repository and CI/CD pipelines [2]
    • enables effective collaboration and deployment in PBIP [2]
    • enables implementing version control in PBIP [2]
      • it’s essential for managing project history and collaboration [2]
      • allows to track changes throughout the model lifecycle [2]
      • allows to enable effective governance and collaboratin
    •  provides robust version tracking and collaboration features, ensuring traceability
  • {best practice} use descriptive commit messages [2]
    • allows to ensure clarity and facilitate collaboration in version control [2]
  • {best practice} avoid sharing Git credentials [2]
    • compromises security and accountability [2]
      •  can lead to potential breaches [2]
  • {best practice} define a naming conventions for files and communicated accordingly [2]
  • {best practice} avoid merging changes directly into the master branch [2]
    • {risk} this can lead to integration issues [2]
  • {best practice} use git merge for integrating changes from one branch to another [2]
    • {benefit} ensures seamless collaboration [2]
  • {best practice} avoid skipping merges [2]
    • failing to merge regularly can lead to complex conflicts and integration challenges [2]
Previous Post <<||>> Next Post 

References:
[1] Microsoft Learn (2022) DeveOps: What is Git? [link]
[2] M Anand, Microsoft Fabric Analytics Engineer Associate: Implementing Analytics Solutions Using Microsoft Fabric (DP-600), 2025 

Acronyms:
PBIP - Power BI Project
CI/CD - Continuous Integration and Continuous Deployment
IDE - Integrated Development Environments
 

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

  • Apache software library
    • backend technology that make storing data and running large-scale parallel computations possible
    • open-source framework 
    • widely adopted 
    • implements special versions of the HDFS
      • enables applications to scale to petabytes of data employing commodity hardware
    • based on MapReduce API 
      • software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
      • {function} Mapper
        • consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]
        • ⇐ analysis usually involve filter and sorting operations) [2]
      • {function} Reducer
        • consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]
  • {benefit} economical scalable storage mode
    • can run on commodity hardware that in turn utilizes commodity disks
      • the price point per terabyte is lower than that of almost any other technology [1]
  • {benefit} massive scalable IO capability
    • aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
    • adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]
      • ⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]
  • {characteristic} reliability
    • enabled by fault-tolerant design
    • ability to replicate by MapReduce execution
      • ⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes
    • data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1] 
      • ⇐ failure of a server does not result in a loss of data [1]
        • ⇐ the job continues even if a server fails
          • ⇐ the processing switches to another server [1]
      • every piece of data is usually replicated across three nodes
        • ⇐ can be located on separate server racks to avoid any single point of failure [1]
  • {characteristic} scalable processing model
    • MapReduce represents a widely applicable and scalable distributed processing model
    • capable of brute-forcing acceptable performance for almost all algorithms [1]
      • not the most efficient implementation for all algorithms
  • {characteristic} schema on read
    • the imposition of structure can be delayed until the data is accessed
    • ⇐ as opposed to the schema on write mode 
    • ⇐ used by relational data warehouses
    • data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]
      • {advantage} data can be quickly ingest from the various forms [1]
        •  this is sometimes referred to as schema on read,  [1]
  • {architecture} Hadoop 1.0
    • mixed nodes
      • the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]
        • each server supplies both data storage and processing capacity (CPU and memory) [1]
    • specialized nodes
      • job tracker node 
        • coordinates the scheduling of jobs run on the Hadoop cluster [1]
      • name node 
        • sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]
      • {disadvantage} architecture limited to MapReduce workloads [1]
      • {disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]
  • {architecture} Hadoop 2.0 
    • layers on top of the Hadoop 1.0 architecture [1]
    • {concept} YARN (aka Yet Another Resource Negotiator)
      • improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]
        • {process} Resource Manager 
          • controls access to the clusters resources (memory, CPU)
        • {process} Application Manager 
          • (one per job) controls task execution
    • treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]
      • allows Hadoop to run tasks based on more complex processing models [1]
  • {concept} Distributed File System 
    • a protocol used for storage and replication of data [1]

Acronyms:
DFS - Distributed File System
DWH - Data Warehouse
HDFS - Hadoop Distributed File System
YARN - Yet Another Resource Negotiator 

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

11 April 2007

🌁Software engineering: Open Source (Definitions)

"A style of licensing software based on the principle that anyone should be able to copy, use, and improve upon a program's source code, although other restrictions may apply." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"A movement in the software industry that makes programs available along with the source code used to create them so others can inspect and modify how programs work." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Software provided with all the source code available, enabling developers to contribute changes. Most open software is available for free; although, many companies also license commercial software on an open basis." (Jon Radoff, "Game On: Energize Your Business with Social Media Games", 2011)

"software created by the worldwide user community. Open source software is generally free, can be modified by anyone, and usually doesn't have any single “owner.” Outsourcing: the process of a company arranging with one or more third parties to provide services that the first company could provide but chose not to." (Bill Holtsnider & Brian D Jaffe, "IT Manager's Handbook" 3rd Ed, 2012)

"A movement in the software industry that makes programs available along with the source code used to create them so that others can inspect and modify how programs work. Changes to source code are shared with the community at large." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"Software is open source if the source code is available to anyone who has access to the software." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"A copyright or licensing system that, compared with conventional commercial licensing schemes, allows wide use and modification of the material." (Mike Harwood, "Internet Security: How to Defend Against Attackers on the Web" 2nd Ed., 2015)

"A program in which the source code is available to the general public for use or modification from its original design free of charge. Common Open Source licenses include the GNU General Public License, GNU Library General Public License, Artistic License, BSD license, Mozilla Public License, and other similar licenses listed at http://www.opensource.org/licenses. Open Source code is typically created as a collaborative effort in which programmers improve on the code and share the changes within the community." (James R Kalyvas & Michael R Overly, "Big Data: A Businessand Legal Guide", 2015)

[Open source platform:] "refers to any program whose source code is made available for use or modification by other users or developers. An open source platform is usually developed as a public collaboration and made freely available." (Accenture)

"Software for which the source code is available under an open licence. Not only can the software be used for free, but users with the necessary technical skills can inspect the source code, modify it and run their own versions of the code, helping to fix bugs, develop new features, etc. Some large open source software projects have thousands of volunteer contributors. The Open Definition was heavily based on the earlier Open Source Definition, which sets out the conditions under which software can be considered open source." (Open Data Handbook)

06 October 2006

⛩️Eric S Raymond - Collected Quotes

"Good programmers know what to write. Great ones know what to rewrite." (Eric S Raymond, "The Cathedral and the Bazaar", 1999)

"If you have the right attitude, interesting problems will find you." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Often, the most striking and innovative solutions come from realizing that your concept of the problem was wrong." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Smart data structures and dumb code works a lot better than the other way around." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Software is largely a service industry operating under the persistent but unfounded delusion that it is a manufacturing industry." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"The next best thing to having good ideas is recognizing good ideas from your users. Sometimes the latter is better." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"To solve an interesting problem, start by finding a problem that is interesting to you." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging." (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"When writing gateway software of any kind, take pains to disturb the data stream as little as possible - and never throw away information unless the recipient forces you to!" (Eric S Raymond, "The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary", 1999)

"Ugly programs are like ugly suspension bridges: they're much more liable to collapse than pretty ones, because the way humans (especially engineer-humans) perceive beauty is intimately related to our ability to process and understand complexity. A language that makes it hard to write elegant code makes it hard to write good code." (Eric S. Raymond, "Why Python?", Linux Journal, 2000)

"A software system is transparent when you can look at it and immediately see what is going on. It is simple when what is going on is uncomplicated enough for a human brain to reason about all the potential cases without strain." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"All OO languages show some tendency to suck programmers into the trap of excessive layering. Object frameworks and object browsers are not a substitute for good design or documentation, but they often get treated as one. Too many layers destroy transparency: It becomes too difficult to see down through them and mentally model what the code is actually doing. The Rules of Simplicity, Clarity, and Transparency get violated wholesale, and the result is code full of obscure bugs and continuing maintenance problems." (Eric S. Raymond, "The Art of Unix Programming", 2003)

"Programmer time is expensive; conserve it in preference to machine time." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"The combination of threads, remote-procedure-call interfaces, and heavyweight object-oriented design is especially dangerous [...] if you are ever invited onto a project that is supposed to feature all three, fleeing in terror might well be an appropriate reaction." (Eric S Raymond, "The Art of UNIX Programming", 2003)

"The only way to write complex software that won't fall on its face is to hold its global complexity down - to build it out of simple pieces connected by well-defined interfaces, so that most problems are local and you can have some hope of fixing or optimizing a part without breaking the whole." (Eric S Raymond, "The Art of UNIX Programming", 2003)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.