SQL Troubles

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)

an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently [1]
{goal} help customers fully realize the benefits of their analytics program [1]
{component} data science lifecycle definition
- {description} a framework to structure the development of data science projects [1]
- {goal} designed for data science projects that ship as part of intelligent applications that deploy ML & AI models for predictive analytics [1]
- {benefit} can be used in the context of other DM methodologies as they have common ground [1]
  - e.g. CRISP-DM, KDD
- {benefit} exploratory data science projects or improvised analytics projects can also benefit from using this process [1]
{component} standardized project structure
- {description} a directory structure that includes templates for project documents
  - ⇒makes it easy for team members to find information [1]
  - ⇐templates for the folder structure and required documents are provided in standard locations [1]
  - all code and documents are stored in an agile VCS tracking repository [1]
    - {recommendation} create a separate repository for each project on the VCS for versioning, information security, and collaboration [1]
- {benefit} organizes the code for the various activities [1]
- {benefit} allows tracking the progress [1]
- {benefit} provides checklist with key questions for each project to guarantee process and deliverables’ quality [1]
- {benefit} enables team collaboration [1]
- {benefit} allows closer tracking of the code for individual features [1]
- {benefit} enables teams to obtain better cost estimates [1]
- {benefit} helps build institutional knowledge across the organization [1]
{component} recommended infrastructure
- {description} a set of recommendations for the infrastructure and resources needed for analytics and storage [1]
- {benefit} addresses cloud and/or on-premises requirements [1]
- {benefit} enables reproducible analysis [1]
- {benefit} avoids infrastructure duplication [1]
  - ⇒minimizes inconsistencies and unnecessary infrastructure costs [1]
- {tools} tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely [1]
- {good practice} create a consistent compute environment [1]
  - ⇐allows team members replicate and validate experiments [1]
{component} recommended tools and utilities
- {description} a set of recommendations for the tools and utilities needed for project’s execution [1]
- {benefit} help lower the barriers and increase the consistency of their adoption [1]
- {benefit} provides an initial set of tools and scripts to jump-start methodology’s adoption [1]
- {benefit} helps automate some of the common tasks in the data science lifecycle [1]
  - e.g. data exploration and baseline modeling [1]
- {benefit} well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository [1]
  - ⇐ resources can then be leveraged by other projects [1]
{phase} 1: business understanding
- {goal} define and document the business problem, its objectives, the needed attributes, and the metric(s) used to determine project’s success
- {goal} identify and document the relevant data sources
- {step} 1.1: define project’s objectives
  - elicit together with the stakeholders the requirements, define and document the problem and its objectives, respectively the metric(s) used to determine project’s success
    - requires a good understanding of the business processes, data and further characteristics
- {step} 1.2: identify data sources
  - identify the attributes and the data sources relevant to the problem under study
- {step} 1.3: define project plan and team*
  - develop a high-level milestone plan and identify the resources needed for executing it
- {tool} project charter
  - standard template that documents the business problem, the scope of the project, the business objectives and metric(s) used to determine project’s success
{phase} 2: data acquisition & understanding
- {goal} prepare the base dataset(s) as needed by the modeling phase into the target repository
- {goal} build the data ETL/ELT architecture and processes needed for provisioning the basis data
- {step} 2.1: ingest data
  - make the required data available for the team in the repository where the analytics operations take place
- {step} 2.2: explore data
  - understand data’s characteristics by leveraging specific tools (visualization, analysis)
  - prepare the data as needed for further processing
- {step} 2.3: set up pipelines
  - build the pipelines needed for data actualization and qualitative assessment [3]
  - set up a process to score new data or refresh the data regularly [3]
- {step} 2.4: feasibility analysis*
  - reevaluate the project to determine whether the value expected is sufficient to continue pursuing it
- {tool} data quality report
  - report that includes data summaries, data mappings, variable ranking, data qualitative assessment(s) and further information [3]
- {tool} solution architecture
  - diagram and/or textual-based description of the data pipeline(s), technical assumptions and further aspects
- {tool} data reports
  - document the structure and statistics of the raw data
- {tool} checkpoint decision
  - decision template document that
    - summarizes the findings of the feasibility analysis step
    - includes a set of choices and recommendations for the next steps
    - serves as basis for the decision on whether to continue or not the project, respectively what the next steps are
{phase} 3: modeling
- {goal} create a machine-learning model that addresses the prediction requirements and that's suitable for production
- {step} 3.1: feature engineering
  - the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis [4]
    - ⇐requires a good understanding of how the features relate to each other and how the ML algorithms use those features [4]
- {step} 3.2: model selection*
  - choose one or more modeling algorithms that address problem’s characteristics the best
- {step} 3.3: model training
  - involves the following steps:
    - split the input data into training and test datasets
    - build the models by using the training dataset
    - evaluate the training and the test data set
    - determine the optimal setup and methods
- {step} 3.4: model evaluation
  - evaluate the performance of the model(s)
- {step} 3.5: feasibility analysis*
  - evaluate the readiness of the models for use into production, respectively on whether they fulfill project’s objectives
- {tool} feature sets
  - describe the features developed for the modeling and how they were generated
  - contains pointers to the code used to generate the features
- {tool} model report
  - a standard, template-based report that provides details on each experiment’s outcomes
  - created for each model tried
- {tool} checkpoint decision
- {tool} model performance metrics
  - e.g. ROC curves or MSE
{phase} 4: deployment
- {goal} deploy the models and the data pipelines to the environment used for final user acceptance
- {step} 4.1: operationalize architecture
  - prepare the models and data pipelines for use into production
  - {best practice} expose the models over an open API interface
    - enables models’ consumption from various applications
  - {best practice} build telemetry and monitoring into the models and the data pipelines [5]
    - helps in monitoring and troubleshooting [5]
- {step} 4.2: deploy solution*
  - deploy the architecture into production
- {tool} status dashboard
  - displays data on system’s health and key metrics
- {tool} model report
  - the report in its final form with deployment information
- {tool} solution architecture
  - the document in its final form
{phase} 5: customer acceptance
- {goal} confirm that project’s objectives were fulfilled and get customer’s acceptance
- {step} 5.1: system validation
  - validate system’s performance and outcomes and confirm that it fulfills customer’s needs
- {step} 5.2: project signoff*
  - finalize and review documentation
  - handover the solution and afferent documentation to customer
  - evaluate the project against the defined objectives and get customer’ signoff
- {tool} exit report
- {tool} technical report
  - contains all the details of the project that are useful for learning about how to operate the system [6]

Acronyms:

Artificial Intelligence (AI)

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Data Mining (DM)

Knowledge Discovery in Databases (KDD)

Team Data Science Process (TDSP)

Version Control System (VCS)

Visual Studio Team Services (VSTS)

Resources:

[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]

[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]

[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]

[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source]

[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]

[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

21 March 2021

𖣯Strategic Management: The Impact of New Technologies (Part III: Checking the Vital Signs)

An organization which went through a major change, like the replacement of a strategic system (e.g. ERP/BI implementations), needs to go through a period of attentive supervision to address the inherent issues that ideally need to be handled as they arise, to minimize their future effects. Some organizations might even go through a convalescence period, which risks to prolong itself if the appropriate remedies aren’t found. Therefore, one needs an entity, who/which has the skills to recognize the symptoms, understand what’s happening and why, respectively of identifying the appropriate actions.

Given technologies’ multi-layered complexity and the volume of knowledge for understanding them, the role of the doctor can be seldom taken by one person. Moreover, the patient is an organization, each person in the organization having usually local knowledge about the patient. The needed knowledge is dispersed trough the organization, and one needs to tap into that knowledge, identify the people close to technologies and business area, respectively allow such people exchange information on a regular basis.

The people who should know the best the organization are in theory the management, however they are usually too far away from technologies and often too busy with management topics. IT professionals are close to technologies, though sometimes too far away from the patient. The users have a too narrow overview, while from logistical and economic reasons the number of people involved should be kept to a minimum. A compromise is to designate one person from each business area who works with any of the strategic systems, and assure that they have the technical and business knowledge required. It’s nothing but the key-user concept, though for it to work the key-users need not only knowledge but also the empowerment to act when the symptoms appear.

Big organizations have also a product owner for each application who supervises the application through its entire lifecycle, and who needs to coordinate with the IT, business and service providers. This is probably a good idea in order to assure that the ROI is reached over time, respectively that the needs of the system are considered within the IT operation context. In small organizations, the role can be taken by a technical or a business resource with deeper skills then the average user, usually a key-user. However, unless joined with the key-user role, the product owner’s focus will be the product and seldom the business themes.

The issues that need to be overcome after major changes are usually cross-functional, being imperative for people to work together and find solutions. Unfortunately, it’s also in human nature to wait until the issues are big enough to get the proper attention. Unless the key-users have the time allocated already for such topics, the issues will be lost in the heap of operational and tactical activities. This time must be allocated for all key-users and the technical resources needed to support them.

Some organizations build temporary working parties (groups of experts working together to achieve specific goals) or similar groups. However, the statute of such group needs to be permanent if the organization wants to continuously have its health in check, to build the needed expertize and awareness about occurred or potential issues. Centers of excellence/expertize (CoE) or competency centers (CC) are such working groups with permanent statute, having defined roles, responsibilities, and processes for supporting and promoting the effective use of technologies within the organization, respectively of monitoring and systematically addressing the risks and opportunities associated with them.

There’s also the null hypothesis, doing nothing, relying solely on employees’ professionalism, though without defined responsibility, accountability and empowerment, it can get messy.

Previous Post <<||>> Next Post

𖣯Strategic Management: The Impact of New Technologies (Part II - The Technology-oriented Patient)

Looking at the way data, information and knowledge flow through an organization, with a little imagination one can see the resemblance between an organization and the human body, in which the networks created by the respective flows spread through organization as nervous, circulatory or lymphatic braids do, each with its own role in the good functioning of the organization. Each technology adopted by an organization taps into these flows creating a structure that can be compared with the nerve plexus, as the various flows intersect in such points creating an agglomeration of nerves and braids.

The size of each plexus can be considered as proportional to the importance of the technology in respect to the overall structure. Strategic technologies like ERP, BI or planning systems, given their importance (gravity), resemble with the organs from the human body, with complex networks of braids in their vicinity. Maybe the metaphor is too far-off, though it allows stressing the importance of each technology in respect to its role and the good functioning of the organization. Moreover, each such structure functions as pressure points that can in extremis block any of the flows considered, a long-term block having important effects.

The human organism is a marvelous piece of work reflecting the grand design, however in time, especially when neglected or driven by external agents, diseases can clutch around any of the parts of the human body, with all the consequences deriving from this. On the other side, an organization is a hand-made structure found in continuous expansion as new technologies or resources are added. Even if the technologies are at peripheral side of the system, their good or bad functioning can have a ripple effect trough the various networks.

Replacing any of the above-mentioned strategic systems can be compared with the replacement of an organ in the human body, having a high degree of failure compared with other operations, being complex in nature, the organism needing long periods to recover, while in extreme situations the convalescence prolongs till the end. Fortunately, organizations seem to be more resilient to such operations, though that’s not necessarily a rule. Sometimes all it takes is just a small mistake for making the operation fail.

The general feeling is that ERP and BI implementations are taken too lightly by management, employees and implementers. During the replacement operation one must make sure not only that the organ fits and functions as expected, but also that the vital networks regained their vitality and function as expected, and the latter is a process that spans over the years to come. One needs to check the important (health) signs regularly and take the appropriate countermeasures. There must be an entity having the role of the doctor, who/which has the skills to address adequately the issues.

Moreover, when the physical structure of an organization is affected, a series of micro-operations might be needed to address the deformities. Unfortunately, these areas are seldom seen in time, and can require a sustained effort for fixing, while a total reconstruction might apply. One works also with an amorphous and ever-changing structure that require many attempts until a remedy is found, if a remedy is possible after all.

Even if such operations are pretty well documented, often what organizations lack are the skilled resources needed during and post-implementation, resources that must know as well the patient, and ideally its historical and further health preconditions. Each patient is different and quite often needs its own treatment/medication. With such changes, the organization lands itself on a discovery journey in which the appropriate path can easily deviate from the well-trodden paths.

Previous Post <<||>> Next Post

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

Each important technology has the potential of creating divides between the specialists from a given field. This aspect is more suggestive in the data-driven fields like BI/Analytics or Data Warehousing. The data professionals (engineers, scientists, analysts, developers) skilled only in the new wave of technologies tend to disregard the role played by the former technologies and their role in the data landscape. The argumentation for such behavior is rooted in the belief that a new technology is better and can solve any problem better than previous technologies did. It’s a kind of mirage professionals and customers can easily fall under.

Being bigger, faster, having new functionality, doesn’t make a tool the best choice by default. The choice must be rooted in the problem to be solved and the set of requirements it comes with. Just because a vibratory rammer is a new technology, is faster and has more power in applying pressure, this doesn’t mean that it will replace a hammer. Where a certain type of power is needed the vibratory rammer might be the best tool, while for situations in which a minimum of power and probably more precision is needed, like driving in a nail, then an adequately sized hammer will prove to be a better choice.

A technology is to be used in certain (business/technological) contexts, and even if contexts often overlap, the further details (aka requirements) should lead to the proper use of tools. It’s in a professional’s duties to be able to differentiate between contexts, requirements and the capabilities of the tools appropriate for each context. In this resides partially a professional’s mastery over its field of work and of providing adequate solutions for customers’ needs. Especially in IT, it’s not enough to master the new tools but also have an understanding about preceding tools, usage contexts, capabilities and challenges.

From an historical perspective each tool appeared to fill a demand, and even if maybe it didn’t manage to fill it adequately, the experience obtained can prove to be valuable in one way or another. Otherwise, one risks reinventing the wheel, or more dangerously, repeating the failures of the past. Each new technology seems to provide a deja-vu from this perspective.

Moreover, a new technology provides new opportunities and requires maybe to change our way of thinking in respect to how the technology is used and the processes or techniques associated with it. Knowledge of the past technologies help identifying such opportunities easier. How a tool is used is also a matter of skills, while its appropriate use and adoption implies an inherent learning curve. Having previous experience with similar tools tends to reduce the learning curve considerably, though hands-on learning is still necessary, and appropriate learning materials or tutoring is upon case needed for a smoother transition.

In what concerns the implementation of mature technologies, most of the challenges were seldom the technologies themselves but of non-technical nature, ranging from the poor understanding/knowledge about the tools, their role and the implications they have for an organization, to an organization’s maturity in leading projects. Even the most-advanced technology can fail in the hands of non-experts. Experience can’t be judged based only on the years spent in the field or the number of projects one worked on, but on the understanding acquired about implementation and usage’s challenges. These latter aspects seem to be widely ignored, even if it can make the difference between success and failure in a technology’s implementation.

Ultimately, each technology is appropriate in certain contexts and a new technology doesn’t necessarily make another obsolete, at least not until the old contexts become obsolete.

Previous Post <<||>>Next Post

14 March 2021

Performance Management: Self-Organizing Teams (Definitions)

"A team that has the flexibility and authority to find its own methods for achieving its goals. Team members are motivated to take work without waiting for it to be assigned. They take responsibility for their work and track their own progress." (Rod Stephens, "Beginning Software Engineering", 2015)

"A team formation where the team functions with an absence of centralized control." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide)", 2017)

"Teams that carry out their work without having a centralized point of control; an agile concept." (Cate McCoy & James L Haner, "CAPM Certified Associate in Project Management Practice Exams", 2018)

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Microsoft Azure: Azure Data Factory (ADF)

{definition} pay-per-use serverless cloud-based data integration service that orchestrates and automates the movement and transformation of both cloud-based and on-premises data sources [1]
- ⇐ a hybrid and scalable data integration service for Big Data and advanced end-to-end analytics solutions [11]
- ⇐ Microsoft Azure PaaS offering for ETL/ELT workloads found at its second generation [11]
- allows creating data-driven flows to orchestrate movement of data between supported data stores and processing of data using compute services in other regions or in an on-premises environment
{benefit} easy-to-use
- {feature} allows creating code-free pipelines with drag-and-drop functionality [2]
- {feature} uses JSON to describe each of its entities
{benefit} cost-effective
- pay-as-you-go model against the Azure subscription with no up-front costs
- low price-to-performance ratio
  - ⇐ cost effective and performant at the same time
- fully managed serverless cloud service that scales on demand [2]
  - ⇒requires zero hardware maintenance [1]
  - ⇒can easily scale beyond what was originally anticipated [1]
- does not store any data [1]
- provides additional cost-saving functionality [11]
  - {feature} it takes care of the provisioning and teardown of the cluster once the job has executed [11]
{benefit} powerful
- allows ingesting on-premise and cloud-based data sources
- high-performance hybrid connectivity
  - over 90 built-in connectors make it easy to interact with all kinds of technologies [11]
- orchestrate at scale
  - on-demand compute
  - Big Data workloads are scaled over multiple nodes to chunk data in parallel [11]
- {feature} [ADFv2] monitoring
  - richer and natively integrating it with Azure Monitor and OMS [11]
    - includes feature-rich monitoring and management tools to visualize the current state of data pipelines, data lineage and pipeline dependencies [1]
- {feature} [ADFv2] control flow functionality
  - lets define complex workflows using programmatic or UI mechanisms
    - allows defining parameters at pipeline level [11]
    - includes custom state passing and looping containers [11]
    - pipelines can be authored via additional tools
      - e.g. PowerShell, .NET, Python, REST APIs
      - ⇒ helps ISVs build SaaS-based analytics solutions on top of ADF app models
{benefit} intelligent
- autonomous ETL allows unlocking operational efficiencies and enable citizen integrators [2]
{benefit} enterprise-grade security:
- provides same security standards as any other Microsoft service [11]
{benefit} monthly release cycle
- {feature} via auto-update
- improvements may include support for new connectors, bug fixes, security patches, and performance improvements [11]
{benefit} backwards compatibility
- {feature} [ADFv2] allows rehosting SSIS solutions [2]
  - ⇒ helpful for modernizing data warehouse solutions
{prerequisite} an Azure subscription with the contributor role assigned to at least one resource group
{limitation} availability
- the service isn’t available in all regions
  - an instance can be made available in other region to trigger the job on customer’s computer environment [1]
    - ⇐ the time for executing the job on the compute environment doesn’t change [1]
{concept} activity
- the unit of orchestration in ADF [1]
- defines the actions to perform on data [1]
- takes zero or more datasets as inputs and produces one or more datasets as outputs [1]
- activity types
  - data movement activities
  - data transformation activities
  - control activities
    - control how the pipeline works and interacts with the data [10]
    - allow executing pipelines [10]
    - allow running a foreach statement or Lookup activities [10]
{concept] pipeline
- logical grouping of activities that together perform a task [1]
  - the sequence can have a complex schedule and dependencies that need to be orchestrated and automated [1]
  - two activities can be chained by setting the output data set of one activity as the input dataset of the other activity
- allows building ETL/ELT workloads
- scheduled by scheduler triggers [10]
- data in a pipeline is referred to by different names
  - ⇐ based on the amount of modification that has been performed
  - raw data
    - data with no processing applied [10]
      - ⇒does not yet have a schema applied
    - stored in the message encoding format used to send tracking events such as JSON.
    - can be organized into meaningful data stores and data lakes [10]
      - ⇐ further used in decision-making
    - it's common to send all tracking events as raw events
      - ⇐ because all events can be sent to a single endpoint and schemas can be applied later in the pipeline [10]
  - processed data
    - raw data that has been decoded in the event-specific formats with the schema applied
      - e.g. JSON tracking events that have been translated into a session start event with a fixed schema [10]
    - usually stored in different event tables and destination in a data pipeline [10]
  - cooked data
    - processed data that has been aggregated or summarized [10]
- {concept} pipeline parameters
  - similar to SSIS package parameters
    - ⇐ need to be set from outside packages
  - can be passed from the parent pipeline
{concept} dataset
- named references/pointers to the data used as an input or an output of an activity [1]
- identifies data structures within different (linked) data stores [1]
  - ⇐ before creating a dataset, a linked service must be created to link the data store to ADF [10]
  - once created, it can be used with activities in a pipeline [10]
    - e.g. a dataset can be an input or output dataset of a copy activity
{concept} linked service
- defines the information needed by ADF to connect to external resources at runtime
  - much like connection strings which define the connection information [10]
- used to represent
  - {concept} data store
    - holds the input-output data to the ADF
    - e.g. tables, files, folders, and documents
  - {concept} compute resource
    - can host the execution of an activity [1]
{concept} scheduler triggers
- allow pipelines to be triggered on a wall-clock schedule [10]
  - pipelines and triggers have an n-m relationship
    - multiple triggers can kick off a single pipeline
    - the same trigger can kick off multiple pipelines
  - manual triggers trigger pipelines on demand [10]
- once defined, it must be started to begin triggering the pipeline [10]
- comes into effect only after publishing the solution to ADF [10]
  - ⇐ not when saving the trigger in the UI [10]
- to run a pipeline, a pipeline reference must be included in trigger definition [10]
- there is a cost associated with each pipeline run
  - {recommendation} when testing, make sure that the pipeline is triggered only a couple of times [10]
  - {recommendation} ensure that there is enough time for the pipeline to run between the published time and the end time [10]

Previous Post <<||>> Next Post

Acronyms:

Azure Data Factory (ADF)

Continuous Integration/Continuous Deployment (CI/CD)

Extract Load Transform (ELT)

Extract Transform Load (ETL)

Independent Software Vendors (ISVs)

Operations Management Suite (OMS)

pay-as-you-go (PAYG)

SQL Server Integration Services (SSIS)

Resources:

[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College

[2] Microsoft (2021) Azure Data Factory [source]

[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]

[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]

[10] Coursera (2021) Data Processing with Azure [source]

[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

Independently of its scope and the methodology used, each software development project is made of the same blocks/phases arranged eventually differently. It starts with Requirements Managements (RM) subprocesses in which the functional and non-functional requirements are gathered, consolidated, prioritized and brought to a form which facilitates their understanding and estimation. It’s an iterative process as there can be overlapping in functionality, requirements that don’t bring any significant benefit when compared with the investment, respectively new aspects are discovered during the internal discussions or with the implementer.

As output of this phase, it’s important having a list of requirements that reflect customer’s needs in respect to the product(s) to be implemented. Once frozen, the list defines project’s scope and is used for estimating the costs, sketching a draft of the final solution, respectively of reaching a contractual agreement with the implementer. Ideally the set of requirements should be completed and be coherent while reflecting customer’s needs. It allows thus in theory to agree upon costs as well about an architecture and other important aspects (responsibilities/accountability).

Typically, each new requirement considered after this stage needs to go through a Change Management (CM) process in which it gets formulated to the needed level of detail, a cost, effort and impact analysis is performed, respectively the budget for it is approved or the change gets rejected. Ideally small changes can be considered as part of a buffer budget upfront, however in the end each change comes with a cost and project delays.

Some changes can come late in the project and can have an important impact on the whole architecture when important aspects were missed upfront. Moreover, when the number of changes goes beyond a certain limit it can lead to what is known as scope creep, with important consequences on project’s costs, timeline and quality. Therefore, to minimize the impact on the project, the number of changes needs to be kept to a minimum, typically considering only the critical changes, while the others can be still implemented after project’s end.

The agile manifesto’s principles impose an important constraint on the requirements - changing requirements is a good practice even late in the process – an assumption - best requirements emerge from self-organizing teams, and probably one implication – the requirements need to be defined together with the implementer.

The way changing requirements are handled seem to provide more flexibility though it’s actually a constraint imposed on the CM process which interfaces with the RM processes. Without a proper CM in place, any requirement might arrive to be implemented, independently on whether is feasible or not. This can easily make project’s costs explode, sometimes unnecessarily, while accommodating extreme behavior like changing the same functionality frequently, handling exceptions extensively, etc.

It’s usually helpful to define the requirements together with the implementer, as this can bring more quality in the process, even if more time needs to be invested. However, starting from a solid set of requirements is a critical factor for project’s success. The manifesto makes no direct statement about this. Just iterates that good requirements emerge from self-organizing teams which is not necessarily the case.

The users who in theory can define the requirements best are the ones who have the deepest knowledge about an organization’s processes and IT architecture, typically the key users and/or IT experts. Self-organization revolves around how a team organizes itself and handles the various activities, though there’s no guarantee that it will address the important aspects, no matter how motivated the team is, how constant the pace, how excellent the technical details were handled or how good the final product works.

Previous Post <<||>>Next Post

💼Project Management: Methodologies (Part I: Agile Manifesto Reloaded I - An Introduction)

There are so many books written on agile methodologies, each attempting to depict the realities of software development projects. There are many truths considered in them, though they seem to blend in a complex texture in which the writer takes usually the position of a preacher in which the sins of the traditional technologies are contrasted with the agile principles. In extremis everything done in the past seems to be wrong, while the agile methods seem to be a panacea, which is seldom the case.

There are already 20 years since the agile manifesto was published and the methodologies adhering to the respective principles don’t seem to provide the expected success, suffering from the same chronical symptoms of their predecessors - they are poorly understood and implemented, tend to function after hammer’s principle, respectively the software development projects still deliver poor results. Moreover, there are more and more professionals who raise their voice against agile practices.

Frankly, the principles behind the agile manifesto make sense. A project should by definition satisfy stakeholders’ requirements, ideally through regular deliveries that incorporate the needed functionality while gradually seeking to get early feedback from customers, respectively involve the customer through all project’s duration, working together to deliver a feasible product. Moreover, self-organizing teams, face-to-face meetings, constant pace, technical excellence should allow minimizing the waste, respectively maximizing the efficiency in the project. Further aspects like simplicity, good design and architecture should establish a basis for success.

Re-reading the agile manifesto, even if each read pulls from experience more and more pro and cons, the manifesto continues to look like a Christmas wish-list. Even if the represented ideas make sense and satisfy a specific need, they are difficult to achieve in a project’s context and setup. Each wish introduces a constraint that brings with it its own limitations. Unfortunately, each policy introduced by a methodology follows the same pattern, no matter of the methodology considered. Moreover, the wishes cover only a small subset from a project’s texture, are general and let lot of space for interpretation and implementation, though the same can be said about any principles that don’t provide a coherent worldview or a conceptual model.

The software development industry needs a coherent worldview that reflects its assumptions, models, characteristics, laws and challenges. Software Engineering (SE) attempts providing such a worldview though unfortunately is too complex for many and there seem to be a big divide when considered in respect to the worldviews introduced by the various Project Management (PM) methodologies. Studying one or two PM methodologies, learning a few programming languages and even the hand on experience on a few projects won’t fill the gaps in knowledge associated with the SE worldview.

Organizations don’t seem to see the need for professionals of having a formal education in SE. On the other side is expected from employees to have by default some of the skillset required, which is not the case. Besides understanding and implementing a technology there are a set of knowledge areas in which the IT professional must have at least a high-level knowledge if it’s expected from him/her to think critically about the respective areas. Unfortunately, the lack of such knowledge leads sometimes to situations which can impact negatively projects.

Almost each important word from the agile manifesto pulls with it a set of concepts from a SE’ worldview – customer satisfaction, software delivery, working software, requirements management, change management, cooperation, teamwork, trust, motivation, communication, metrics, stakeholders’ management, good design, good architecture, lessons learned, performance management, etc. The manifesto needs to be regarded from a SE’s eyeglasses if one expects value from it.

Previous Post <<||>> Next Post

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

Motion is the action or process of moving or being moved between an initial and a final or intermediate point. From the tinniest endeavors to the movement of the planets and beyond, everything is governed by motion. If the laws of nature seem to reveal an inner structural perfection, the activities people perform are quite often far from perfect, which is acceptable if we consider that (almost) everything is a learning process. What is probably less acceptable is the volume of inefficient motion we can easily categorize sometimes as waste.

The waste associated with motion can take many forms: sorting through a pile of tools to find the right one, searching for information, moving back and forth to reach a destination or achieve a goal, etc. Suboptimal motion can have important effects for an organization resulting in reduced productivity, respectively higher costs.

If for repetitive activities that involve a certain degree of similarity can be found typically a way to optimize the motion, the higher the uncertainty of the steps involved, the more difficult it becomes to optimize it. It’s the case of discovery endeavors in which the path between start and destination can’t be traced beforehand, respectively when the destination or path in between can’t be depicted to the needed level of detail. A strategy’s implementation, ERP implementations and other complex projects, especially the ones dealing with new technologies and/or incomplete knowledge, tend to be exploratory in nature and thus fall under this latter type a motion.

In other words, one must know at minimum the starting point, the destination, how to reach it and what it takes to reach it – resources, knowledge, skillset. When one has all this information one can go on and estimate how long it will take to reach the destination, though the estimate reflects the information available as well estimator’s skills in translating the information into a realistic roadmap. Each new information has the potential of impacting considerably the whole process, in extremis to the degree that one must start the journey anew. The complexity of such projects and the volume of uncertainty can make estimation difficult if not impossible, no matter how good estimators' skills are. At best an estimator can come with a best- and worst-case estimation, both however dependent on the assumptions made.

Moreover, complex projects are sensitive to the initial conditions or auspices under which they start. This sensitivity can turn a project in a totally different direction or pace, that can be reinforced positively or negatively as the project progresses. It’s a continuous interplay between internal and external factors and components that can create synergies or have adverse effects with the potential of reaching tipping points.

Related to the initial conditions, as the praxis sometimes shows, for entities found in continuous movement (like organizations) it’s also important to know from where one’s coming (and at what speed), as the previous impulse (driving force) can be further used or stirred as needed. Metaphorically, a project will need a certain time to find the right pace if it lacks the proper impulse.

Unless the team is trained to play and plays like an orchestra, the impact of deviations from expectations can be hardly quantified. To minimize the waste, ideally a project’s journey should minimally deviate from the optimal path, which can be challenging to achieve as a project’s mass can pull the project in one direction or the other. The more the project advances the bigger the mass, fact which can make a project unstoppable. When such high-mass projects are stopped, their impulse can continue to haunt the organization years after.

Previous Post <<||>> Next Post

💼Project Management: Project Execution (Part III: Projects' Dynamics - An Introduction)

Despite the considerable collection of books on Project Management (PM) and related methodologies, and the fact that projects are inherent endeavors in professional as well personal life (setups that would give in theory people the environment and exposure to different project types), people’s understanding on what it takes to plan and execute a project seems to be narrow and questionable sometimes. Moreover, their understanding diverges considerably from common sense. It’s also true that knowledge and common sense are relative when considering any human endeavor in which there are multiple roads to the same destination, or when learning requires time, effort, skills, and implies certain prerequisites, however the lack of such knowledge can hurt when endeavor’s success is a must and a team effort.

Even if the lack of understanding about PM can be considered as minor when compared with other challenges/problems faced by a project, when one’s running fast to finish a race, even a small pebble in one’s running shoes can hurt a lot, especially when one doesn’t have the luxury to stop and remove the stone, as it would make sense to do.

It resides in the human nature to resist change, to seek for information that only confirm own opinions, to follow the same approach in handling challenges, even if the attempts are far from optimal, even if people who walked the same path tell you that there’s a better way and even sketch the path and provide information about what it takes to reach there. As it seems, there’s the predisposition to learn on the hard way, if there’s significant learning involved at all. Unfortunately, such situations occur in projects and the solutions often overrun the boundaries of PM, where social and communication skills must be brought into play.

On the other side, there’s still hope that change can be managed optimally once the facts are explained to a certain level that facilitates understanding. However, such an attempt can prove to be quite a challenge, given the various setups in which PM takes place. The intersection between technologies and organizational setups lead to complex scenarios which make such work more difficult, even if projects’ challenges are of organizational rather than technological nature.

When the knowledge we have about the world doesn’t fit our expectation, a simple heuristic is to return to the basics. A solid edifice can be built only on a solid foundation and the best foundation in coping with reality is to establish common ground with other people. One can achieve this by identifying their suppositions and expectations, by closing the gap in perception and understanding, by establishing a basis for communication, in which feedback is a must if one wants to make significant progress.

Despite of being explorative and time-consuming, establishing common ground can be challenging when addressing to an imaginary audience, which is quite often the situation. The practice shows however that progress can be made by starting with a set of well-formulated definitions, simple models, principles, and heuristics that have the potential of helping in sense-making.

The goal is thus to identify first the definitions that reflect the basic concepts that need to be considered. Once the concepts defined, they can be related to each other with the help of a few models. Even if fictitious, as simplifications of the reality, the models should allow playing with the concepts, facilitating concepts’ understanding. Principles (set of rules for reasoning) can be used together with heuristics (rules of thumb methods or techniques) for explaining the ‘known’ and approaching the ‘unknown’. Even maybe not perfect, these tools can help building theories or explanatory constructs.

||>>Next Post

27 February 2021

🐍Python: PySpark and GraphFrames (Test Drive)

Besides the challenges met during configuring the PySpark & GraphFrames environment, also running my first example in Spyder IDE proved to be a bit more challenging than expected. Starting from an example provided by the DataBricks documentation on GraphFrames, I had to add 3 more lines to establish the connection of the Spark cluster, respectively to deactivate the context (only one SparkContext can be active per Java VM).

The following code displays the vertices and edges, respectively the in and out degrees for a basic graph.

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

Output:

id	name	age
a	Alice	34
b	Bob	36
c	Charlie	30
d	David	29
e	Esther	32
f	Fanny	36
g	Gabby	60

src	dst	relationship
a	b	friend
b	c	follow
c	b	follow
f	c	follow
e	f	follow
e	d	friend
d	a	friend
a	e	friend

id	inDegree
f	1
e	1
d	1
c	2
b	2
a	1

id	outDegree
f	1
e	2
d	1
c	1
b	1
a	2

Notes:
Without the last line, running a second time the code will halt with the following error:
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at D:\Work\Python\untitled0.py:4

Loading the same data from a csv file involves a small overhead as the schema needs to be defined explicitly. The same output from above should be provided by the following code:

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import * 

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

nodes = [
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
]
edges = [
    StructField("src", StringType(), True),
    StructField("dst", StringType(), True),
    StructField("relationship", StringType(), True)
    ]

v = spark.read.csv(r"D:\data\nodes.csv", header=True, schema=StructType(nodes))

e = spark.read.csv(r"D:\data\edges.csv", header=True, schema=StructType(edges))

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

The 'nodes.csv' file has the following content:
id,name,age
"a","Alice",34
"b","Bob",36
"c","Charlie",30
"d","David",29
"e","Esther",32
"f","Fanny",36
"g","Gabby",60

The 'edges.csv' file has the following content:
src,dst,relationship
"a","b","friend"
"b","c","follow"
"c","b","follow"
"f","c","follow"
"e","f","follow"
"e","d","friend"
"d","a","friend"
"a","e","friend"

Note:
There should be no spaces between values (e.g. "a", "b"), otherwise the results might deviate from expectations.

Now, one can go and test further operations on the graph thus created:

#filtering edges 
gl = g.edges.filter("relationship = 'follow'").sort("src")
gl.show()
print("number edges: ", gl.count())

#filtering vertices
#gl = g.vertices.filter("age >= 30 and age<40").sort("id")
#gl.show()
#print("number vertices: ", gl.count())

# relationships involving edges and vertices
#motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
#motifs.show()

Happy coding!

🐍Python: Installing PySpark and GraphFrames on a Windows 10 Machine

One of the To-Dos for this week was to set up the environment so I can start learning PySpark and GraphFrames based on the examples from Needham & Hodler’s free book on Graph Algorithms. Therefore, I downloaded and installed the Java SDK 8 from the Oracle website (requires an Oracle account) and the latest stable version of Python (Python 3.9.2), downloaded and unzipped the Apache Spark package locally on a Windows 10 machine, respectively the Winutils tool as described here.

The setup requires several environment variables that need to be created, respectively the Path variable needs to be extended with further values (delimited by ";"). In the end I added the following values:

Variable	Value
HADOOP_HOME	D:\Programs\spark-3.0.2-bin-hadoop2.7
SPARK_HOME	D:\Programs\spark-3.0.2-bin-hadoop2.7
JAVA_HOME	D:\Programs\Java\jdk1.8.0_281
PYTHONPATH	D:\Programs\Python\Python39\
PYTHONPATH	;%SPARK_HOME%\python
PYTHONPATH	%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip
PATH	%HADOOP_HOME%\bin
PATH	%SPARK_HOME%\bin
PATH	%PYTHONPATH%
PATH	%PYTHONPATH%\DLLs
PATH	%PYTHONPATH%\Lib
PATH	%JAVA_HOME%\bin

I tried then running the first example from Chapter 3 using the Spyder IDE, though the environment didn’t seem to recognize the 'graphframes' library. As long it's not already available, the graphframes .jar file (e.g. graphframes-0.8.1-spark3.0-s_2.12.jar) corresponding to the installed Spark version must be downloaded and copied in the Spark folder where the other .jar files are available (e.g. .\spark-3.0.2-bin-hadoop2.7\jars). With this change I could finally run my example, though it took me several tries to get this right.

During Python's installation I had to change the value for the LongPathsEnabled setting from 0 to 1 via regedit to allow path lengths longer than 260 characters, as mentioned in the documentation. The setting is available via the following path:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem

In the process I also tried installing ‘pyspark’ and ‘graphframes’ via the Anaconda tool with the following commands:

pip3 install --user pyspark
pip3 install --user graphframes

From Anaconda’s point of view the installation was correct, fact which pointed me to the missing 'graphframe' library.

It took me 4-5 hours of troubleshooting and searching until I got my environment setup. I still have two more warnings to solve, though I will look into this later:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Notes:
Spaces in the folder's names might creates issues. Therefore, I used 'Programs' instead of 'Program Files' as main folder.
There seem to be some confusion what environment variables are needed and how they need to be configured.
Unfortunately, the troubleshooting involved in setting up an environment and getting a simple example to work seems to be a recurring story over the years. Same situation was with the programming languages from 15-20 years ago.

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

Probably one of the most misunderstood aspects for businesses is the implications the adoption of a new technology have in terms of effort, resources, infrastructure and changes, these considered before, during and post-implementation. Unfortunately, getting a new BI tool or ERP system is not like buying a new car, even if customers’ desires might revolve around such expectations. After all, the customer has been using a BI tool or ERP system for ages, the employees should be able to do the same job as before, right?

In theory adopting a new system is supposed to bring organizations a competitive advantage or other advantages - allow them reduce costs, improve their agility and decision-making, etc. However, the advantages brought by new technologies remain only as potentials unless their capabilities aren’t harnessed adequately. Keeping the car metaphor, besides looking good in the car, having a better mileage or having x years of service, buying a highly technologically-advanced car more likely will bring little benefit for the customer unless he needs, is able to use, and uses the additional features.

Both types of systems mentioned above can be quite expensive when considering the benefits associated with them. Therefore, looking at the features and the further requirements is critical for better understanding the fit. In the end one doesn’t need to buy a luxurious or sport car when one just needs to move from point A to B on small distances. In some occasions a bike or a rental car might do as well. Moreover, besides the acquisition costs, the additional features might involve considerable investments as long the warranty is broken and something needs to be fixed. In extremis, after a few years it might be even cheaper to 'replace' the whole car. Unfortunately, one can’t change systems yet, as if they were cars.

Implementing a new BI tool can take a few weeks if it doesn’t involve architecture changes within the BI infrastructure. Otherwise replacing a BI infrastructure can take from months to one year until having a stable environment. Similarly, an ERP solution can take from six months to years to implement and typically this has impact also on the BI infrastructure. Moreover, the implementation is only the top of the iceberg as further optimizations and changes are needed. It can take even more time until seeing the benefits for the investment.

A new technology can easily have the impact of dominoes within the organization. This effect is best reflected in sayings of the type: 'the wise tell us that a nail keeps a shoe, a shoe a horse, a horse a man, a man a castle, that can fight' and which reflect the impact tools technologies have within organizations when regarded within the broader context. Buying a big car, might involve extending the garage or eventually buying a new house with a bigger garage, or of replacing other devices just for the sake of using them with the new car. Even if not always perceptible, such dependencies are there, and even if the further investments might be acceptable and make sense, the implications can be a bigger shoe that one can wear. Then, the reversed saying can hold: 'for want of a nail, the shoe was lost; for want of a shoe the horse was lost; and for want of a horse the rider was lost'.

For IT technologies the impact is multidimensional as the change of a technology has impact on the IT infrastructure, on the processes associated with them, on the resources required and their skillset, respectively on the various types of flows (data, information, knowledge, materials, money).

Previous Post <<||>> Next Post

SQL Troubles

Pages

29 March 2021

Notes: Team Data Science Process (TDSP)

21 March 2021

𖣯Strategic Management: The Impact of New Technologies (Part III: Checking the Vital Signs)

𖣯Strategic Management: The Impact of New Technologies (Part II - The Technology-oriented Patient)

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

🧭Business Intelligence: New Technologies, Old Challenges (Part I: An Introduction)

14 March 2021

Performance Management: Self-Organizing Teams (Definitions)

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

07 March 2021

💼Project Management: Methodologies (Part II: Agile Manifesto Reloaded II - Requirements Management)

💼Project Management: Methodologies (Part I: Agile Manifesto Reloaded I - An Introduction)

04 March 2021

💼Project Management: Project Execution (Part IV: Projects' Dynamics II - Motion)

💼Project Management: Project Execution (Part III: Projects' Dynamics - An Introduction)

27 February 2021

🐍Python: PySpark and GraphFrames (Test Drive)

🐍Python: Installing PySpark and GraphFrames on a Windows 10 Machine

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

About Me