27 February 2021

Python: Test Drive PySpark and GraphFrames

Besides the challenges met during configuring the PySpark & GraphFrames environment, also running my first example in Spyder IDE proved to be a bit more challenging than expected. Starting from an example provided by the DataBricks documentation on GraphFrames, I had to add 3 more lines to establish the connection of the Spark cluster, respectively to deactivate the context (only one SparkContext can be active per Java VM).

The following code displays the vertices and edges, respectively the in and out degrees for a basic graph. 

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

Output:
id nameage
a Alice34
b Bob36
cCharlie30
d David29
e Esther32
f Fanny36
g Gabby60
srcdstrelationship
a b friend
b c follow
c b follow
f c follow
e f follow
e d friend
d a friend
a e friend
idinDegree
f1
e1
d1
c2
b2
a1
idoutDegree
f1
e2
d1
c1
b1
a2

Notes:
Without the last line, running a second time the code will halt with the following error: 
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at D:\Work\Python\untitled0.py:4

Loading the same data from a csv file involves a small overhead as the schema needs to be defined explicitly. The same output from above should be provided by the following code:

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import * 

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

nodes = [
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
]
edges = [
    StructField("src", StringType(), True),
    StructField("dst", StringType(), True),
    StructField("relationship", StringType(), True)
    ]

v = spark.read.csv(r"D:\data\nodes.csv", header=True, schema=StructType(nodes))

e = spark.read.csv(r"D:\data\edges.csv", header=True, schema=StructType(edges))

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

The 'nodes.csv' file has the following content:
id,name,age
"a","Alice",34
"b","Bob",36
"c","Charlie",30
"d","David",29
"e","Esther",32
"f","Fanny",36
"g","Gabby",60

The 'edges.csv' file has the following content:
src,dst,relationship
"a","b","friend"
"b","c","follow"
"c","b","follow"
"f","c","follow"
"e","f","follow"
"e","d","friend"
"d","a","friend"
"a","e","friend"

Note:
There should be no spaces between values (e.g. "a", "b"), otherwise the results might deviate from expectations. 

Now, one can go and test further operations on the graph thus created:

#filtering edges 
gl = g.edges.filter("relationship = 'follow'").sort("src")
gl.show()
print("number edges: ", gl.count())

#filtering vertices
#gl = g.vertices.filter("age >= 30 and age<40").sort("id")
#gl.show()
#print("number vertices: ", gl.count())

# relationships involving edges and vertices
#motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
#motifs.show()

Happy coding!

Python: Installing PySpark and GraphFrames on a Windows 10 Machine

One of the To-Dos for this week was to set up the environment so I can start learning PySpark and GraphFrames based on the examples from Needham & Hodler’s free book on Graph Algorithms. Therefore, I downloaded and installed the Java SDK 8 from the Oracle website (requires an Oracle account) and the latest stable version of Python (Python 3.9.2), downloaded and unzipped the Apache Spark package locally on a Windows 10 machine, respectively the Winutils tool as described here.

The setup requires several environment variables that need to be created, respectively the Path variable needs to be extended with further values (delimited by ";"). In the end I added the following values:

VariableValue
HADOOP_HOMED:\Programs\spark-3.0.2-bin-hadoop2.7
SPARK_HOMED:\Programs\spark-3.0.2-bin-hadoop2.7
JAVA_HOMED:\Programs\Java\jdk1.8.0_281
PYTHONPATHD:\Programs\Python\Python39\
PYTHONPATH;%SPARK_HOME%\python
PYTHONPATH%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip
PATH%HADOOP_HOME%\bin
PATH%SPARK_HOME%\bin
PATH%PYTHONPATH%
PATH%PYTHONPATH%\DLLs
PATH%PYTHONPATH%\Lib
PATH%JAVA_HOME%\bin

I tried then running the first example from Chapter 3 using the Spyder IDE, though the environment didn’t seem to recognize the 'graphframes' library. As long it's not already available, the graphframes .jar file (e.g. graphframes-0.8.1-spark3.0-s_2.12.jar) corresponding to the installed Spark version must be downloaded and copied in the Spark folder where the other .jar files are available (e.g. .\spark-3.0.2-bin-hadoop2.7\jars). With this change I could finally run my example, though it took me several tries to get this right. 

During Python's installation I had to change the value for the LongPathsEnabled setting from 0 to 1 via regedit to allow path lengths longer than 260 characters, as mentioned in the documentation. The setting is available via the following path:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem

In the process I also tried installing ‘pyspark’ and ‘graphframes’ via the Anaconda tool with the following commands:

pip3 install --user pyspark
pip3 install --user graphframes

From Anaconda’s point of view the installation was correct, fact which pointed me to the missing 'graphframe' library.

It took me 4-5 hours of troubleshooting and searching until I got my environment setup. I still have two more warnings to solve, though I will look into this later:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Notes:
Spaces in the folder's names might creates issues. Therefore, I used 'Programs' instead of 'Program Files' as main folder. 
There seem to be some confusion what environment variables are needed and how they need to be configured.
Unfortunately, the troubleshooting involved in setting up an environment and getting a simple example to work seems to be a recurring story over the years. Same situation was with the programming languages from 15-20 years ago. 

22 February 2021

Strategic Management: The Impact of New Technologies I (A nail keeps the shoe)

Strategic Management

Probably one of the most misunderstood aspects for businesses is the implications the adoption of a new technology have in terms of effort, resources, infrastructure and changes, these considered before, during and post-implementation. Unfortunately, getting a new BI tool or ERP system is not like buying a new car, even if customers’ desires might revolve around such expectations. After all, the customer has been using a BI tool or ERP system for ages, the employees should be able to do the same job as before, right?

In theory adopting a new system is supposed to bring organizations a competitive advantage or other advantages - allow them reduce costs, improve their agility and decision-making, etc. However, the advantages brought by new technologies remain only as potentials unless their capabilities aren’t harnessed adequately. Keeping the car metaphor, besides looking good in the car, having a better mileage or having x years of service, buying a highly technologically-advanced car more likely will bring little benefit for the customer unless he needs, is able to use, and uses the additional features.

Both types of systems mentioned above can be quite expensive when considering the benefits associated with them. Therefore, looking at the features and the further requirements is critical for better understanding the fit. In the end one doesn’t need to buy a luxurious or sport car when one just needs to move from point A to B on small distances. In some occasions a bike or a rental car might do as well. Moreover, besides the acquisition costs, the additional features might involve considerable investments as long the warranty is broken and something needs to be fixed. In extremis, after a few years it might be even cheaper to 'replace' the whole car. Unfortunately, one can’t change systems yet, as if they were cars.

Implementing a new BI tool can take a few weeks if it doesn’t involve architecture changes within the BI infrastructure. Otherwise replacing a BI infrastructure can take from months to one year until having a stable environment. Similarly, an ERP solution can take from six months to years to implement and typically this has impact also on the BI infrastructure. Moreover, the implementation is only the top of the iceberg as further optimizations and changes are needed. It can take even more time until seeing the benefits for the investment.

A new technology can easily have the impact of dominoes within the organization. This effect is best reflected in sayings of the type: 'the wise tell us that a nail keeps a shoe, a shoe a horse, a horse a man, a man a castle, that can fight' and which reflect the impact tools technologies have within organizations when regarded within the broader context. Buying a big car, might involve extending the garage or eventually buying a new house with a bigger garage, or of replacing other devices just for the sake of using them with the new car. Even if not always perceptible, such dependencies are there, and even if the further investments might be acceptable and make sense, the implications can be a bigger shoe that one can wear. Then, the reversed saying can hold: 'for want of a nail, the shoe was lost; for want of a shoe the horse was lost; and for want of a horse the rider was lost'.

For IT technologies the impact is multidimensional as the change of a technology has impact on the IT infrastructure, on the processes associated with them, on the resources required and their skillset, respectively on the various types of flows (data, information, knowledge, materials, money).

Previous Post <<||>> Next Post

17 February 2021

Python: Plotting Data with the Radar Chart

Today's task was to display a set of data using the radar chart available with the matplotlib.pyplot library. For this I considered the iris dataset available with the sklearn learning library. The dataset is stored as an array, therefore for further manipulation was converted into a data frame. As the radar chart allows comparing only a small set of numerical values, I considered displaying only the mean values for each type of iris (setosas versicolor, virginica). 

Unfortunately, the radar chart doesn't seem to complete the polygons based on the available dataset, therefore as workaround I had to duplicate the first column within the result data frame. (It seems that the Ploty library does a better job at displaying radar charts, see example).

Radar Chart

Here's the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd

#preparing the data frame 
iris = datasets.load_iris()

ds = pd.DataFrame(data = iris.data, columns = iris.feature_names)
dr = ds.assign(target = iris.target) #iris type

group_by_iris = dr.groupby('target').mean()
group_by_iris[''] = group_by_iris[iris.feature_names[0]] #duplicating the first column

# creating the graph
fig = plt.subplots()

angle = np.linspace(start=0, stop=2 * np.pi, num=len(group_by_iris.columns))

plt.figure(figsize=(5, 5))
plt.subplot(polar=True)

values = group_by_iris[:1].values[0]
plt.plot(angle, values, label='Iris-setosa', color='r')
plt.fill(angle, values, 'r', alpha=0.2)

values = group_by_iris[1:2].values[0]
plt.plot(angle, values, label='Iris-versicolor', color='g')
plt.fill(angle, values, 'g', alpha=0.2)

values = group_by_iris[2:3].values[0]
plt.plot(angle, values, label='Iris-virginica', color='b')
plt.fill(angle, values, 'b', alpha=0.2)

#labels
plt.title('Iris comparison', size=15)
labels = plt.thetagrids(np.degrees(angle), labels=group_by_iris.columns.values)
plt.legend()

plt.show()

Happy coding!

16 February 2021

Python: Drawing Concentric Circles with matplotlib.pyplot

Today I tried using for the first time the matplotlib library for drawing a few concentric circles, though it proved to be a bit more challenging than expected, as the circles were distorted given the scale differences between x and y axis. Because of this the circles displayed via the Circle class (in blue) seem to be displayed as ellipses. To show the difference I used  trigonometric functions to draw the circles (in green) by applying a 5/7.5 multiplication factor for the x axis:

And here's the code:

import numpy as np
import math as m
import matplotlib.pyplot as plt

axis_dimensions = [-100,100, -100,100] #dimensions axis
dx=10       #distance between ticks on x axis 
dy=10       #distance between ticks on y axis
sfx = 5/7.5 #scale factor for x axis
r= 50       #radius

#drawing the grid
plt.axis(axis_dimensions)
plt.axis('on')
plt.grid(True, color='gray')
plt.xticks(np.arange(axis_dimensions[0], axis_dimensions[1], dx))
plt.yticks(np.arange(axis_dimensions[2], axis_dimensions[3], dy))

#adding labels
plt.title('Circles')
plt.xlabel('x axis')
plt.ylabel('y axis')

#drawing the geometric figures
for i in range(0,51,10):
    for angle in np.arange(m.radians(0),m.radians(360),m.radians(2)):
        #drawing circles via trigonometric functions
        x = (r+i)*m.cos(angle)*sfx
        y = (r+i)*m.sin(angle)
        plt.scatter(x,y,s=2,color ='g')
        
    #drawing with circles
    circle = plt.Circle((0,0),r+i,color='b', fill=False)
    plt.gca().add_patch(circle)

plt.show()

Happy coding!

04 February 2021

Data Migrations (DM): Conceptualization VII (Data Import Layer)

Data Migration

The data requirements for the Data Migration (DM) and Data Quality (DQ) are driven by the processes implemented in the target system(s). Therefore, a good knowledge of these requirements can decrease the effort needed for these two subprojects considerably. The needed knowledge basis starts with the entities and their attributes, the dependencies existing between them and the various rules that apply, and ends with the parametrization requirements, respectively the architecture(s) that can be used to import the data.

The DM process starts with defining the entities in scope and their attributes, respectively identifying the corresponding entities and attributes from the legacy systems. The attributes not having a correspondent in the legacy system need to be provided by the business and integrated in the DM logic. In addition, it’s needed to consider also the attributes needed by the business and not available in the target system, some of them more likely available in the legacy systems. For such attributes is needed either to misuse an attribute from the target or to extend the target system.

For each entity is created a data mapping that basically documents the data transformations needed for migrating the data. In the process is needed to consider also attributes’ data types, the (standard) formatting, their domain of definition, as well the various rules that apply. Their implementation belongs into the DM layer from which the data are exported in a standard format as needed by the target system.

Exporting the data from the DM layer directly into the target system’s tables has in theory the lowest overhead even if the rejected records are difficult to track, the rejections resulting only from records’ ‘validation against database’s schema. For this approach to work, one must have a good knowledge of the database schema and of the business rules implemented into the target system.

To solve the issue with errors’ logging, systems have a further layer on top of the database model, which also allow running data validation against target system’s business rules. Modern import frameworks allow loading the data via a set of standard files with a predefined structure. The data can be thus imported manually or via load jobs into the system a log with the issues being generated in the process. Some frameworks allow even the manual editing of failed records, respectively to import the data. Unfortunately, calling the layer from the DM layer is not possible from a database, though this would bring seldom a benefit. Some third-party tools attempt to improve the import functionality by calling the target system’s import layer.

The import files must be generated from the DM layer in the required structure with the appropriate formatting. The challenge however resides in identifying all the attributes that should make scope of the load. It’s an iterative process which sometimes is backed by try-and-error heuristics. Unless target system’s validation rules are known beforehand, the rules need to be discovered in this process, which can prove time-consuming. The discoveries need to be integrated also in the DM and from here results the big number of changes that need to be performed.

Given the dependencies existing between entities the files need to be generated and loaded in a predefined order. These dependencies are reflected also in the data processing and the validation rules considered in the DM layer.

A quality checkpoint can be implemented between the export from the DM layer and import to enforce the four-eyes principle. It’s normally the last opportunity for trapping the eventual issues. A further quality check is performed after import by validating on whether the data were imported as expected.

Previous Post <<||>> Next Post

Data Migrations (DM): Conceptualization VI (Data Migration Layer)

Data Migration

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

03 February 2021

Data Migrations (DM): Conceptualization V (Data Extraction Layer)

Data Migration

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

Data Migrations (DM): Conceptualization IV (Data Access)

Data Migration

Once the data sources for a Data Migration (DM) were identified the first question is how the data can be accessed. The legacy systems relying on ODBC-based databases are in theory relatively easy to access as long they allow the direct access to their data, which would enable thus a pull strategy. Despite this, there are organizations that don’t allow the direct access to the data even for read-only operations, being preferred to push the data directly to the consumers (aka push strategy) or push the data to a given location from where the consumer can use the data as needed (aka hybrid strategy). 

The direct access to the data allows in theory the best flexibility as the solution can extract the data when needed and this especially important during the initial phases of the project when the data need to be pulled more frequently until the requirements and logic is stabilized. A push strategy tends to add additional overhead as usually somebody else oversees the data exports, respectively the data need to be prepared in the expected format. On the other side, it would make sense to make an exception for a DM and allow the direct access to the data. 

 Hybrid strategies tend to be more complex and require additional resources or overhead as the data are stored temporarily at a separate location. Unfortunately, in certain scenarios this is the only approach can be used. Are preferred data files that allow keeping the integrity of the data and facilitate data consumption. Therefore, tabular text files or JSON files are preferred in the detriment of XML or Excel files. It’s preferable to export one data structure individually then storing parent-child solutions even if the latter can prove to be useful in certain scenarios. When there’s no other solution one can use also the standard reports available in the legacy systems.

When storing data outside the legacy systems for further processing it’s recommended to follow organization’s best practices, respectively to address the data security and privacy requirements. ETL tools allows accessing data from password protected areas like FTP, OneDrive or SharePoint. The fewer security layers in between the lower is in theory the overhead. Therefore, given its stability and simplicity FTP might prove to be a better storage solution than OneDrive, SharePoint or other similar technologies.

Ideally the extraction/export mechanisms should use the database objects that encapsulate already the logic in the legacy systems otherwise the team will need to reengineer the logic – for master data this can prove to be easy, though the logic of transactional data like on-hand or open invoices can be relatively complex to reengineer. Otherwise, the logic can be implemented directly in the extraction/export mechanisms or sometimes is more advisable to create database objects (usually on a different schema) on the legacy systems and just call the respective objects. 

When connecting directly to the data source it’s advisable using the data provider which allows the best performance and flexibility, however several tests might be needed to determine the best fit. It would be useful to check the limitations of each provider and find a stable driver version.  OLEDB and ADO.Net data providers provide in general a good performance, though native drivers of the legacy systems can be a better option upon case. 

Some legacy systems allow the access to their data only via service-based technologies like OData. OData tends to have poor performance for large data exports than standard access methods and therefore not indicated in such scenarios. In such cases might be a good idea to export the data directly from the legacy system. 

Previous Post <<||>> Next Post

Data Migrations (DM): Conceptualization III (Heuristics)

Data Migration

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool.  Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution? 

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution. 

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal. 

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available. 

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources. 

Previous Post <<||>> Next Post

Data Migrations (DM): Conceptualization II (Plan vs. Concept vs. Strategy)

Data Migration

A concept is a document that describes at high level the set of necessary steps and their implications to achieve a desired result, typically making the object of a project. A concept is usually needed to provide more technical and nontechnical information about the desired solution, the context in which a set of steps are conducted, respectively the changes considered, how the changes will be implemented and the further aspects that need to be considered. It can include a high-level plan and sometimes also information that typically belong in a Business Case – goals,objectives, required resources, estimated effort and costs, risks and opportunities.

A concept is used primarily as basis for sign-off as well for establishing common ground and understanding. When approved, it’s used for the actual implementation and solution’s validation. The concept should be updated as the project progresses, respectively as new information are discovered.

Creating a concept for a DM can be considered as best practice because it allows documenting the context, the technical and organizational requirements and dependencies existing between the DM and other projects, how they will be addressed. The concept can include also a high-level plan of the main activities (following to be detailed in a separate document).

Especially when the concept has an exploratory nature (due to incomplete knowledge or other considerations), it can be validated with the help of a proof-of-concept (PoC), the realization of a high-level-design prototype that focuses on the main characteristics of the solution and allows thus identifying the challenges. Once the PoC implemented, the feedback can be used to round out the concept.

Building a PoC for a DM should be considered as objective even when the project doesn’t seem to meet any major challenges. The PoC should resume in addressing the most important DM requirements, ideally by implementing the whole or most important aspects of functionality (e.g. data extraction, data transformations, integrity validation, respectively the import into the target system) for one or two data entities. Once the PoC built, the team can use it as basis for the evolutive development of the solution during the iterations considered.

A strategy is a set of coordinated and sustainable actions following a set of well-defined goals, actions devised into a plan and designed to create value and overcome further challenges. A strategy has the character of a concept though it has a broader scope being usually considered when multiple projects or initiatives compete for the same resources to provide a broader context and handle the challenges, risks and opportunities. Moreover, the strategy takes an inventory of the current issues and architecture – the 'AS-IS' perspective and sketches the to 'TO-BE' perspective by devising a roadmap that bridges the gap between the two.

In the case of a DM a strategy might be required when multiple DM projects need to be performed in parallel or sequentially, as it can help the organization to better manage the migrations.

A plan is a high-level document that describes the tasks, schedule and resources required to carry on an activity. Even if it typically refers to the work or product breakdown structure, it can cover other information usually available in a Business Case. A project plan is used to guide both project execution and project control, while in the context of Strategic Management the (strategic) plan provides a high-level roadmap on how the defined goals and objectives will be achieved during the period covered by the strategy.

For small DM projects a plan can be in theory enough. As both a strategy and a concept can include a high-level plan, the names are in praxis interchangeable.

Previous Post <<||>> Next Post

02 February 2021

Data Migrations (DM): Conceptualization I (Goals, Objectives & Requirements)

Data Migration

One of the nowadays’ challenges is finding the right mix of technologies that allows building a solution for a business need. There are so many choices and the responsible person is easily tempted to use one of the trending technologies just because he wants to learn something new or the technologies seem to fit into the bigger picture, which probably in many cases it would be acceptable. Unfortunately, there’s also the tendency of picking a technology without looking at what functionality it provides, respectively whether the functionality meets intended solutions’ requirements. Moreover, the requirements are sometimes barely defined at the appropriate level of detail, fact that makes from the implementation project a candidate for failure. Sometimes even the goals and objectives aren’t clearly stated, fact that can make a project’s success easily questionable from the beginning. 

A goal is a general statement that reflects the desired result toward which an organization’s effort needs to be directed. For example, a Data Migration (DM)’s primary goal can be formulated as 'to make available all the master and transactional data needed by the business from the legacy systems to the target system(s) within expected timeline and quality with a minimal disruption for the business'. 

An objective is a break down of the goal into several components that should foster a clear understanding on how the goal will be achieved. Ideally the objectives should be SMART (specific, measurable, attainable, relevant, time-bound), even if measurable objectives are sometimes hard to define properly. One can consider them as the tactics used in achieving the goal. For example, the above formulated goal can be broken down into the following objectives:

  • Build a DM concept/strategy
  • Build a flexible and performant infrastructure for DM that can be adapted to further requirements
  • Provide a basis for further DMs
  • Align DM and main project’s requirements and activities
  • Provide an interface and support for the Data Management areas
  • Foster trust, transparency and awareness 
  • Address internal/external compliance requirements
  • Document and communicate accountability for the various activities
  • Cleanse and enrich the data needed by the target system 
  • Archive the DM and project data 

One can attempt defining the objectives directly from the goal(s), though unless one is aware of all the implication a DM has, more likely one will be forced to define and evaluate the individual functional and nonfunctional requirements for the DM first, and attempt consolidating the requirements into a set of objectives. In the end it can be a combination of both, in which some objectives are first formulated, the requirements are defined and evaluated, respectively the objectives are refined to accommodate the requirements. 

ISO 9126, an international standard for the evaluation of software quality, defines about 45-50 attributes that can be used for addressing the requirements of software solutions, attributes that reflect functionality, reliability, usability, efficiency, and maintainability characteristics. One can start with such a list and identify how important are the respective attributes for the solution.  The next step would be to document the requirements into a consolidated list by providing a short argumentation for their use, respectively how they will be addressed as part of the solution. The process can prove to be time-consuming, however it is a useful exercise that usually needs to be done only once and be reviewed occasionally.

The list can be created independently of any other documentation or be included directly into a concept or strategy. The latter will assure in theory that the document provides a unitary view of the migration, considering that each new or obsolete requirement can impact the concept. 

Previous Post <<||>> Next Post

01 February 2021

Data Migrations (DM): Quality Acceptance Criteria V

Data Migration

Efficiency 

Efficiency is the degree to which a solution uses the hardware (storage, network) and other organizational resources to fulfill a given task. Data characterized by high volume, velocity, variety and veracity can be challenging to process, requiring upon case more processing power. Therefore, the DM solutions need to consider these aspects as well. However, efficiency refers on whether the available resources are used efficiently – the waste in terms of resource utilization is minimal. 

On the other side the waste of resources can be acceptable when there are other benefits or requirements that need to be considered, respectively when the ratio between resources utilization and effort to built more efficient processes is acceptable.

A DM solution involves iterative and exploratory processes in which knowledge and feedback is integrated in each iteration, therefore it might look like resources are not used efficiently. However, this is a way to handle complexity and uncertainty by breaking the effort in manageable chunks.

Learnability

Learnability is the degree to which a person can become familiar with a solution’s use, the data and the processes associated with it. A DM can be challenging for many technical and non-technical resources as it requires a certain level of skillset and understanding of the requirements, needs and deliverables. The complexity of the data and requirements can be overwhelming, however with appropriate communication and awareness established, the challenges can be overcome. 

Stability

Stability is the degree to which a solution is sensitive to environment changes (e.g. overuse of resource, hardware or software failures, updates), respectively on whether it performs with no performance defects or it does not crash under defined levels of stress. Stability can be monitored during the various phases and countermeasures need to be considered in case the solution is not stable enough (e.g. redesigning the solution, breaking the data in smaller chunks)

Suitability 

Suitability is the degree to which a solutions provides functions that meet the stated and implied needs. No matter how performant and technologically advanced a solution is, it brings less value as long it doesn’t perform what it was intended to do.

Transparency 

Transparency is the degree to which a solution’s stakeholders have access to the requirements, processes, data, documentation, or other information required by them. In a DM transparency is important especially important in respect to the data, logic and rules used in data processing, respectively the number of records processed. 

Trustability

Trustability is the degree to which a solution can be trusted to provide the expected results. Even if the technical team assures that the solution can deliver what was indented, the success of a DM is a matter of perception from stakeholders’ perspective. Providing transparency into the data, rules and processes can improve the level of trust however, special attention need to be given to the issues raised by stakeholders during and after Go-Live, as differences need to be mitigated. 

Understandability 

Understandability is the degree to which the requirements of a solution were understood by the resources involved in terms of what needs to be performed. For the average project resource it might be challenging to understand the implications of a DM, and this can apply to technical as well non-technical resources. Making people aware of the implications is probably one of the most important criteria for success, as the success of a migration is often a matter of perception. 

Usability 

Usability is the degree to which a solution can be used by the targeted users within the agreed context of usage. Ideally DM solutions need to be easy to use, though there are always trade-offs. In the end, a DM must fit the purpose it was built for. 

Previous Post <

Data Migrations (DM): Quality Acceptance Criteria IV

Data Migration

Reliability

Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.

Reversibility

Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary. 

Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system. 

Robustness

Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.

Security 

Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.

Scalability

Scalability is the degree to which the solution is able to respond to an increased workload.  Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.  

Standardization

Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.  

Testability

Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points – points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.  

Traceability

Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values. 

Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values. 

Data Migrations (DM): Quality Acceptance Criteria III

Data Migration

Repeatability

Repeatability is the degree with which a DM can be repeated and obtain consistent results between repetitions. Even if a DM is supposed to be a one-time activity for a project, to guarantee a certain level of quality it’s important to consider several iterations in which the data requirements are refined and made sure that the data can be imported as needed into the target system(s). Considered as a process, as long the data and the rules haven’t changed, the results should be the same or have the expected level of deviation from expectations. 

This requirement is important especially for the data migrated during UAT and Go-Live, time during which the input data and rules need to remain frozen (even if small changes in the data can still occur). In fact, that’s the role of UAT – to assure that the data have the expected quality and when compared to the previous dry-run, that it attains the expected level of consistency. 

Reusability

Reusability is the degree to which the whole solution, parts of the logic or data can be reused for multiple purposes. Master data and the logic associated with them have high reusability potential as they tend to be referenced by multiple entities. 

Modularity

Modularity is the degree to which a solution is composed of discrete components such that a change to one component has minimal impact on other components. It applies to the solution itself but also to the degree to which the logic for the various entities is partitioned so to assure a minimal impact. 

Partitionability

Partitionability is the degree to which data or logic can be partitioned to address the various requirements. Despite the assurance that the data will be migrated only once, in practice this assumption can be easily invalidated. It’s enough to increase the system freeze by a few days and/or to have transaction data that suddenly requires master data not considered. Even if the deltas can be migrated in system manually, it’s probably recommended to migrate them using the same logic. Moreover, the performing of incremental loads can be a project requirement. 

Data might need to be partitioned into batches to improve processing’s performance. Partitioning the logic based on certain parameters (e.g. business unit, categorical values) allows more flexibility in handling other requirements (e.g. reversibility, performance, testability, reusability). 

Performance

Performance refers to the degree a piece of software can process data into an amount of time considered as acceptable for the business. It can vary with the architecture and methods used, respectively data volume, veracity, variance, variability, or quality.

Performance is a critical requirement for a DM, especially when considering the amount of time spent on executing the logic during development, tests and troubleshooting, as well for other activities. Performance is important during dry-runs but more important during Go-Live, as it equates with a period during which the system(s) are not available for the users. Upon case, a few hours of delays can have an important impact on the business. In extremis, the delays can sum up to days. 

Predictability

Predictability is the degree to which the results and behavior of a solution, respectively the processes involve are predictable based on the design, implementation or other factors considered (e.g. best practices, methodology used, experience, procedures and processes). Highly predictable solutions are desirable, though reaching the required level of performance and quality can be challenging. 

The results from the dry-runs can offer an indication on whether the data migrated during UAT and Go-Live provide a certain level of assurance that the DM will be a success. Otherwise, an additional dry-run should be planned during UAT, if the schedule allows it.

Previous Post <> Nest Post 

Related Posts Plugin for WordPress, Blogger...