SQL Troubles

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

Independently of its scope and the methodology used, each software development project is made of the same blocks/phases arranged eventually differently. It starts with Requirements Managements (RM) subprocesses in which the functional and non-functional requirements are gathered, consolidated, prioritized and brought to a form which facilitates their understanding and estimation. It’s an iterative process as there can be overlapping in functionality, requirements that don’t bring any significant benefit when compared with the investment, respectively new aspects are discovered during the internal discussions or with the implementer.

As output of this phase, it’s important having a list of requirements that reflect customer’s needs in respect to the product(s) to be implemented. Once frozen, the list defines project’s scope and is used for estimating the costs, sketching a draft of the final solution, respectively of reaching a contractual agreement with the implementer. Ideally the set of requirements should be completed and be coherent while reflecting customer’s needs. It allows thus in theory to agree upon costs as well about an architecture and other important aspects (responsibilities/accountability).

Typically, each new requirement considered after this stage needs to go through a Change Management (CM) process in which it gets formulated to the needed level of detail, a cost, effort and impact analysis is performed, respectively the budget for it is approved or the change gets rejected. Ideally small changes can be considered as part of a buffer budget upfront, however in the end each change comes with a cost and project delays.

Some changes can come late in the project and can have an important impact on the whole architecture when important aspects were missed upfront. Moreover, when the number of changes goes beyond a certain limit it can lead to what is known as scope creep, with important consequences on project’s costs, timeline and quality. Therefore, to minimize the impact on the project, the number of changes needs to be kept to a minimum, typically considering only the critical changes, while the others can be still implemented after project’s end.

The agile manifesto’s principles impose an important constraint on the requirements - changing requirements is a good practice even late in the process – an assumption - best requirements emerge from self-organizing teams, and probably one implication – the requirements need to be defined together with the implementer.

The way changing requirements are handled seem to provide more flexibility though it’s actually a constraint imposed on the CM process which interfaces with the RM processes. Without a proper CM in place, any requirement might arrive to be implemented, independently on whether is feasible or not. This can easily make project’s costs explode, sometimes unnecessarily, while accommodating extreme behavior like changing the same functionality frequently, handling exceptions extensively, etc.

It’s usually helpful to define the requirements together with the implementer, as this can bring more quality in the process, even if more time needs to be invested. However, starting from a solid set of requirements is a critical factor for project’s success. The manifesto makes no direct statement about this. Just iterates that good requirements emerge from self-organizing teams which is not necessarily the case.

The users who in theory can define the requirements best are the ones who have the deepest knowledge about an organization’s processes and IT architecture, typically the key users and/or IT experts. Self-organization revolves around how a team organizes itself and handles the various activities, though there’s no guarantee that it will address the important aspects, no matter how motivated the team is, how constant the pace, how excellent the technical details were handled or how good the final product works.

Previous Post <<||>>Next Post

💼Project Management: Methodologies (Part I: Agile Manifesto Reloaded I - An Introduction)

There are so many books written on agile methodologies, each attempting to depict the realities of software development projects. There are many truths considered in them, though they seem to blend in a complex texture in which the writer takes usually the position of a preacher in which the sins of the traditional technologies are contrasted with the agile principles. In extremis everything done in the past seems to be wrong, while the agile methods seem to be a panacea, which is seldom the case.

There are already 20 years since the agile manifesto was published and the methodologies adhering to the respective principles don’t seem to provide the expected success, suffering from the same chronical symptoms of their predecessors - they are poorly understood and implemented, tend to function after hammer’s principle, respectively the software development projects still deliver poor results. Moreover, there are more and more professionals who raise their voice against agile practices.

Frankly, the principles behind the agile manifesto make sense. A project should by definition satisfy stakeholders’ requirements, ideally through regular deliveries that incorporate the needed functionality while gradually seeking to get early feedback from customers, respectively involve the customer through all project’s duration, working together to deliver a feasible product. Moreover, self-organizing teams, face-to-face meetings, constant pace, technical excellence should allow minimizing the waste, respectively maximizing the efficiency in the project. Further aspects like simplicity, good design and architecture should establish a basis for success.

Re-reading the agile manifesto, even if each read pulls from experience more and more pro and cons, the manifesto continues to look like a Christmas wish-list. Even if the represented ideas make sense and satisfy a specific need, they are difficult to achieve in a project’s context and setup. Each wish introduces a constraint that brings with it its own limitations. Unfortunately, each policy introduced by a methodology follows the same pattern, no matter of the methodology considered. Moreover, the wishes cover only a small subset from a project’s texture, are general and let lot of space for interpretation and implementation, though the same can be said about any principles that don’t provide a coherent worldview or a conceptual model.

The software development industry needs a coherent worldview that reflects its assumptions, models, characteristics, laws and challenges. Software Engineering (SE) attempts providing such a worldview though unfortunately is too complex for many and there seem to be a big divide when considered in respect to the worldviews introduced by the various Project Management (PM) methodologies. Studying one or two PM methodologies, learning a few programming languages and even the hand on experience on a few projects won’t fill the gaps in knowledge associated with the SE worldview.

Organizations don’t seem to see the need for professionals of having a formal education in SE. On the other side is expected from employees to have by default some of the skillset required, which is not the case. Besides understanding and implementing a technology there are a set of knowledge areas in which the IT professional must have at least a high-level knowledge if it’s expected from him/her to think critically about the respective areas. Unfortunately, the lack of such knowledge leads sometimes to situations which can impact negatively projects.

Almost each important word from the agile manifesto pulls with it a set of concepts from a SE’ worldview – customer satisfaction, software delivery, working software, requirements management, change management, cooperation, teamwork, trust, motivation, communication, metrics, stakeholders’ management, good design, good architecture, lessons learned, performance management, etc. The manifesto needs to be regarded from a SE’s eyeglasses if one expects value from it.

Previous Post <<||>> Next Post

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

💼Project Management: Project Execution (Part III: Projects' Dynamics - An Introduction)

Despite the considerable collection of books on Project Management (PM) and related methodologies, and the fact that projects are inherent endeavors in professional as well personal life (setups that would give in theory people the environment and exposure to different project types), people’s understanding on what it takes to plan and execute a project seems to be narrow and questionable sometimes. Moreover, their understanding diverges considerably from common sense. It’s also true that knowledge and common sense are relative when considering any human endeavor in which there are multiple roads to the same destination, or when learning requires time, effort, skills, and implies certain prerequisites, however the lack of such knowledge can hurt when endeavor’s success is a must and a team effort.

Even if the lack of understanding about PM can be considered as minor when compared with other challenges/problems faced by a project, when one’s running fast to finish a race, even a small pebble in one’s running shoes can hurt a lot, especially when one doesn’t have the luxury to stop and remove the stone, as it would make sense to do.

It resides in the human nature to resist change, to seek for information that only confirm own opinions, to follow the same approach in handling challenges, even if the attempts are far from optimal, even if people who walked the same path tell you that there’s a better way and even sketch the path and provide information about what it takes to reach there. As it seems, there’s the predisposition to learn on the hard way, if there’s significant learning involved at all. Unfortunately, such situations occur in projects and the solutions often overrun the boundaries of PM, where social and communication skills must be brought into play.

On the other side, there’s still hope that change can be managed optimally once the facts are explained to a certain level that facilitates understanding. However, such an attempt can prove to be quite a challenge, given the various setups in which PM takes place. The intersection between technologies and organizational setups lead to complex scenarios which make such work more difficult, even if projects’ challenges are of organizational rather than technological nature.

When the knowledge we have about the world doesn’t fit our expectation, a simple heuristic is to return to the basics. A solid edifice can be built only on a solid foundation and the best foundation in coping with reality is to establish common ground with other people. One can achieve this by identifying their suppositions and expectations, by closing the gap in perception and understanding, by establishing a basis for communication, in which feedback is a must if one wants to make significant progress.

Despite of being explorative and time-consuming, establishing common ground can be challenging when addressing to an imaginary audience, which is quite often the situation. The practice shows however that progress can be made by starting with a set of well-formulated definitions, simple models, principles, and heuristics that have the potential of helping in sense-making.

The goal is thus to identify first the definitions that reflect the basic concepts that need to be considered. Once the concepts defined, they can be related to each other with the help of a few models. Even if fictitious, as simplifications of the reality, the models should allow playing with the concepts, facilitating concepts’ understanding. Principles (set of rules for reasoning) can be used together with heuristics (rules of thumb methods or techniques) for explaining the ‘known’ and approaching the ‘unknown’. Even maybe not perfect, these tools can help building theories or explanatory constructs.

||>>Next Post

27 February 2021

🐍Python: PySpark and GraphFrames (Test Drive)

Besides the challenges met during configuring the PySpark & GraphFrames environment, also running my first example in Spyder IDE proved to be a bit more challenging than expected. Starting from an example provided by the DataBricks documentation on GraphFrames, I had to add 3 more lines to establish the connection of the Spark cluster, respectively to deactivate the context (only one SparkContext can be active per Java VM).

The following code displays the vertices and edges, respectively the in and out degrees for a basic graph.

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

Output:

id	name	age
a	Alice	34
b	Bob	36
c	Charlie	30
d	David	29
e	Esther	32
f	Fanny	36
g	Gabby	60

src	dst	relationship
a	b	friend
b	c	follow
c	b	follow
f	c	follow
e	f	follow
e	d	friend
d	a	friend
a	e	friend

id	inDegree
f	1
e	1
d	1
c	2
b	2
a	1

id	outDegree
f	1
e	2
d	1
c	1
b	1
a	2

Notes:
Without the last line, running a second time the code will halt with the following error:
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at D:\Work\Python\untitled0.py:4

Loading the same data from a csv file involves a small overhead as the schema needs to be defined explicitly. The same output from above should be provided by the following code:

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import * 

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

nodes = [
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
]
edges = [
    StructField("src", StringType(), True),
    StructField("dst", StringType(), True),
    StructField("relationship", StringType(), True)
    ]

v = spark.read.csv(r"D:\data\nodes.csv", header=True, schema=StructType(nodes))

e = spark.read.csv(r"D:\data\edges.csv", header=True, schema=StructType(edges))

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

The 'nodes.csv' file has the following content:
id,name,age
"a","Alice",34
"b","Bob",36
"c","Charlie",30
"d","David",29
"e","Esther",32
"f","Fanny",36
"g","Gabby",60

The 'edges.csv' file has the following content:
src,dst,relationship
"a","b","friend"
"b","c","follow"
"c","b","follow"
"f","c","follow"
"e","f","follow"
"e","d","friend"
"d","a","friend"
"a","e","friend"

Note:
There should be no spaces between values (e.g. "a", "b"), otherwise the results might deviate from expectations.

Now, one can go and test further operations on the graph thus created:

#filtering edges 
gl = g.edges.filter("relationship = 'follow'").sort("src")
gl.show()
print("number edges: ", gl.count())

#filtering vertices
#gl = g.vertices.filter("age >= 30 and age<40").sort("id")
#gl.show()
#print("number vertices: ", gl.count())

# relationships involving edges and vertices
#motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
#motifs.show()

Happy coding!

🐍Python: Installing PySpark and GraphFrames on a Windows 10 Machine

One of the To-Dos for this week was to set up the environment so I can start learning PySpark and GraphFrames based on the examples from Needham & Hodler’s free book on Graph Algorithms. Therefore, I downloaded and installed the Java SDK 8 from the Oracle website (requires an Oracle account) and the latest stable version of Python (Python 3.9.2), downloaded and unzipped the Apache Spark package locally on a Windows 10 machine, respectively the Winutils tool as described here.

The setup requires several environment variables that need to be created, respectively the Path variable needs to be extended with further values (delimited by ";"). In the end I added the following values:

Variable	Value
HADOOP_HOME	D:\Programs\spark-3.0.2-bin-hadoop2.7
SPARK_HOME	D:\Programs\spark-3.0.2-bin-hadoop2.7
JAVA_HOME	D:\Programs\Java\jdk1.8.0_281
PYTHONPATH	D:\Programs\Python\Python39\
PYTHONPATH	;%SPARK_HOME%\python
PYTHONPATH	%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip
PATH	%HADOOP_HOME%\bin
PATH	%SPARK_HOME%\bin
PATH	%PYTHONPATH%
PATH	%PYTHONPATH%\DLLs
PATH	%PYTHONPATH%\Lib
PATH	%JAVA_HOME%\bin

I tried then running the first example from Chapter 3 using the Spyder IDE, though the environment didn’t seem to recognize the 'graphframes' library. As long it's not already available, the graphframes .jar file (e.g. graphframes-0.8.1-spark3.0-s_2.12.jar) corresponding to the installed Spark version must be downloaded and copied in the Spark folder where the other .jar files are available (e.g. .\spark-3.0.2-bin-hadoop2.7\jars). With this change I could finally run my example, though it took me several tries to get this right.

During Python's installation I had to change the value for the LongPathsEnabled setting from 0 to 1 via regedit to allow path lengths longer than 260 characters, as mentioned in the documentation. The setting is available via the following path:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem

In the process I also tried installing ‘pyspark’ and ‘graphframes’ via the Anaconda tool with the following commands:

pip3 install --user pyspark
pip3 install --user graphframes

From Anaconda’s point of view the installation was correct, fact which pointed me to the missing 'graphframe' library.

It took me 4-5 hours of troubleshooting and searching until I got my environment setup. I still have two more warnings to solve, though I will look into this later:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Notes:
Spaces in the folder's names might creates issues. Therefore, I used 'Programs' instead of 'Program Files' as main folder.
There seem to be some confusion what environment variables are needed and how they need to be configured.
Unfortunately, the troubleshooting involved in setting up an environment and getting a simple example to work seems to be a recurring story over the years. Same situation was with the programming languages from 15-20 years ago.

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

Probably one of the most misunderstood aspects for businesses is the implications the adoption of a new technology have in terms of effort, resources, infrastructure and changes, these considered before, during and post-implementation. Unfortunately, getting a new BI tool or ERP system is not like buying a new car, even if customers’ desires might revolve around such expectations. After all, the customer has been using a BI tool or ERP system for ages, the employees should be able to do the same job as before, right?

In theory adopting a new system is supposed to bring organizations a competitive advantage or other advantages - allow them reduce costs, improve their agility and decision-making, etc. However, the advantages brought by new technologies remain only as potentials unless their capabilities aren’t harnessed adequately. Keeping the car metaphor, besides looking good in the car, having a better mileage or having x years of service, buying a highly technologically-advanced car more likely will bring little benefit for the customer unless he needs, is able to use, and uses the additional features.

Both types of systems mentioned above can be quite expensive when considering the benefits associated with them. Therefore, looking at the features and the further requirements is critical for better understanding the fit. In the end one doesn’t need to buy a luxurious or sport car when one just needs to move from point A to B on small distances. In some occasions a bike or a rental car might do as well. Moreover, besides the acquisition costs, the additional features might involve considerable investments as long the warranty is broken and something needs to be fixed. In extremis, after a few years it might be even cheaper to 'replace' the whole car. Unfortunately, one can’t change systems yet, as if they were cars.

Implementing a new BI tool can take a few weeks if it doesn’t involve architecture changes within the BI infrastructure. Otherwise replacing a BI infrastructure can take from months to one year until having a stable environment. Similarly, an ERP solution can take from six months to years to implement and typically this has impact also on the BI infrastructure. Moreover, the implementation is only the top of the iceberg as further optimizations and changes are needed. It can take even more time until seeing the benefits for the investment.

A new technology can easily have the impact of dominoes within the organization. This effect is best reflected in sayings of the type: 'the wise tell us that a nail keeps a shoe, a shoe a horse, a horse a man, a man a castle, that can fight' and which reflect the impact tools technologies have within organizations when regarded within the broader context. Buying a big car, might involve extending the garage or eventually buying a new house with a bigger garage, or of replacing other devices just for the sake of using them with the new car. Even if not always perceptible, such dependencies are there, and even if the further investments might be acceptable and make sense, the implications can be a bigger shoe that one can wear. Then, the reversed saying can hold: 'for want of a nail, the shoe was lost; for want of a shoe the horse was lost; and for want of a horse the rider was lost'.

For IT technologies the impact is multidimensional as the change of a technology has impact on the IT infrastructure, on the processes associated with them, on the resources required and their skillset, respectively on the various types of flows (data, information, knowledge, materials, money).

Previous Post <<||>> Next Post

17 February 2021

📊🐍Python: Plotting Data with the Radar Chart

Today's task was to display a set of data using the radar chart available with the matplotlib.pyplot library. For this I considered the iris dataset available with the sklearn learning library. The dataset is stored as an array, therefore for further manipulation was converted into a data frame. As the radar chart allows comparing only a small set of numerical values, I considered displaying only the mean values for each type of iris (setosas versicolor, virginica).

Unfortunately, the radar chart doesn't seem to complete the polygons based on the available dataset, therefore as workaround I had to duplicate the first column within the result data frame. (It seems that the Ploty library does a better job at displaying radar charts, see example).

Here's the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd

#preparing the data frame 
iris = datasets.load_iris()

ds = pd.DataFrame(data = iris.data, columns = iris.feature_names)
dr = ds.assign(target = iris.target) #iris type

group_by_iris = dr.groupby('target').mean()
group_by_iris[''] = group_by_iris[iris.feature_names[0]] #duplicating the first column

# creating the graph
fig = plt.subplots()

angle = np.linspace(start=0, stop=2 * np.pi, num=len(group_by_iris.columns))

plt.figure(figsize=(5, 5))
plt.subplot(polar=True)

values = group_by_iris[:1].values[0]
plt.plot(angle, values, label='Iris-setosa', color='r')
plt.fill(angle, values, 'r', alpha=0.2)

values = group_by_iris[1:2].values[0]
plt.plot(angle, values, label='Iris-versicolor', color='g')
plt.fill(angle, values, 'g', alpha=0.2)

values = group_by_iris[2:3].values[0]
plt.plot(angle, values, label='Iris-virginica', color='b')
plt.fill(angle, values, 'b', alpha=0.2)

#labels
plt.title('Iris comparison', size=15)
labels = plt.thetagrids(np.degrees(angle), labels=group_by_iris.columns.values)
plt.legend()

plt.show()

Happy coding!

16 February 2021

📊🐍Python: Drawing Concentric Circles with matplotlib.pyplot

Today I tried using for the first time the matplotlib library for drawing a few concentric circles, though it proved to be a bit more challenging than expected, as the circles were distorted given the scale differences between x and y axis. Because of this the circles displayed via the Circle class (in blue) seem to be displayed as ellipses. To show the difference I used trigonometric functions to draw the circles (in green) by applying a 5/7.5 multiplication factor for the x axis:

And here's the code:

import numpy as np
import math as m
import matplotlib.pyplot as plt

axis_dimensions = [-100,100, -100,100] #dimensions axis
dx=10       #distance between ticks on x axis 
dy=10       #distance between ticks on y axis
sfx = 5/7.5 #scale factor for x axis
r= 50       #radius

#drawing the grid
plt.axis(axis_dimensions)
plt.axis('on')
plt.grid(True, color='gray')
plt.xticks(np.arange(axis_dimensions[0], axis_dimensions[1], dx))
plt.yticks(np.arange(axis_dimensions[2], axis_dimensions[3], dy))

#adding labels
plt.title('Circles')
plt.xlabel('x axis')
plt.ylabel('y axis')

#drawing the geometric figures
for i in range(0,51,10):
    for angle in np.arange(m.radians(0),m.radians(360),m.radians(2)):
        #drawing circles via trigonometric functions
        x = (r+i)*m.cos(angle)*sfx
        y = (r+i)*m.sin(angle)
        plt.scatter(x,y,s=2,color ='g')
        
    #drawing with circles
    circle = plt.Circle((0,0),r+i,color='b', fill=False)
    plt.gca().add_patch(circle)

plt.show()

Happy coding!

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

Data Migrations Series

The data requirements for the Data Migration (DM) and Data Quality (DQ) are driven by the processes implemented in the target system(s). Therefore, a good knowledge of these requirements can decrease the effort needed for these two subprojects considerably. The needed knowledge basis starts with the entities and their attributes, the dependencies existing between them and the various rules that apply, and ends with the parametrization requirements, respectively the architecture(s) that can be used to import the data.

The DM process starts with defining the entities in scope and their attributes, respectively identifying the corresponding entities and attributes from the legacy systems. The attributes not having a correspondent in the legacy system need to be provided by the business and integrated in the DM logic. In addition, it’s needed to consider also the attributes needed by the business and not available in the target system, some of them more likely available in the legacy systems. For such attributes is needed either to misuse an attribute from the target or to extend the target system.

For each entity is created a data mapping that basically documents the data transformations needed for migrating the data. In the process is needed to consider also attributes’ data types, the (standard) formatting, their domain of definition, as well the various rules that apply. Their implementation belongs into the DM layer from which the data are exported in a standard format as needed by the target system.

Exporting the data from the DM layer directly into the target system’s tables has in theory the lowest overhead even if the rejected records are difficult to track, the rejections resulting only from records’ ‘validation against database’s schema. For this approach to work, one must have a good knowledge of the database schema and of the business rules implemented into the target system.

To solve the issue with errors’ logging, systems have a further layer on top of the database model, which also allow running data validation against target system’s business rules. Modern import frameworks allow loading the data via a set of standard files with a predefined structure. The data can be thus imported manually or via load jobs into the system a log with the issues being generated in the process. Some frameworks allow even the manual editing of failed records, respectively to import the data. Unfortunately, calling the layer from the DM layer is not possible from a database, though this would bring seldom a benefit. Some third-party tools attempt to improve the import functionality by calling the target system’s import layer.

The import files must be generated from the DM layer in the required structure with the appropriate formatting. The challenge however resides in identifying all the attributes that should make scope of the load. It’s an iterative process which sometimes is backed by try-and-error heuristics. Unless target system’s validation rules are known beforehand, the rules need to be discovered in this process, which can prove time-consuming. The discoveries need to be integrated also in the DM and from here results the big number of changes that need to be performed.

Given the dependencies existing between entities the files need to be generated and loaded in a predefined order. These dependencies are reflected also in the data processing and the validation rules considered in the DM layer.

A quality checkpoint can be implemented between the export from the DM layer and import to enforce the four-eyes principle. It’s normally the last opportunity for trapping the eventual issues. A further quality check is performed after import by validating on whether the data were imported as expected.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

SQL Troubles

Pages

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

💼Project Management: Methodologies (Part I: Agile Manifesto Reloaded I - An Introduction)

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

💼Project Management: Project Execution (Part III: Projects' Dynamics - An Introduction)

27 February 2021

🐍Python: PySpark and GraphFrames (Test Drive)

🐍Python: Installing PySpark and GraphFrames on a Windows 10 Machine

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

17 February 2021

📊🐍Python: Plotting Data with the Radar Chart

16 February 2021

📊🐍Python: Drawing Concentric Circles with matplotlib.pyplot

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

About Me