SQL Troubles

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

Probably one of the most misunderstood aspects for businesses is the implications the adoption of a new technology have in terms of effort, resources, infrastructure and changes, these considered before, during and post-implementation. Unfortunately, getting a new BI tool or ERP system is not like buying a new car, even if customers’ desires might revolve around such expectations. After all, the customer has been using a BI tool or ERP system for ages, the employees should be able to do the same job as before, right?

In theory adopting a new system is supposed to bring organizations a competitive advantage or other advantages - allow them reduce costs, improve their agility and decision-making, etc. However, the advantages brought by new technologies remain only as potentials unless their capabilities aren’t harnessed adequately. Keeping the car metaphor, besides looking good in the car, having a better mileage or having x years of service, buying a highly technologically-advanced car more likely will bring little benefit for the customer unless he needs, is able to use, and uses the additional features.

Both types of systems mentioned above can be quite expensive when considering the benefits associated with them. Therefore, looking at the features and the further requirements is critical for better understanding the fit. In the end one doesn’t need to buy a luxurious or sport car when one just needs to move from point A to B on small distances. In some occasions a bike or a rental car might do as well. Moreover, besides the acquisition costs, the additional features might involve considerable investments as long the warranty is broken and something needs to be fixed. In extremis, after a few years it might be even cheaper to 'replace' the whole car. Unfortunately, one can’t change systems yet, as if they were cars.

Implementing a new BI tool can take a few weeks if it doesn’t involve architecture changes within the BI infrastructure. Otherwise replacing a BI infrastructure can take from months to one year until having a stable environment. Similarly, an ERP solution can take from six months to years to implement and typically this has impact also on the BI infrastructure. Moreover, the implementation is only the top of the iceberg as further optimizations and changes are needed. It can take even more time until seeing the benefits for the investment.

A new technology can easily have the impact of dominoes within the organization. This effect is best reflected in sayings of the type: 'the wise tell us that a nail keeps a shoe, a shoe a horse, a horse a man, a man a castle, that can fight' and which reflect the impact tools technologies have within organizations when regarded within the broader context. Buying a big car, might involve extending the garage or eventually buying a new house with a bigger garage, or of replacing other devices just for the sake of using them with the new car. Even if not always perceptible, such dependencies are there, and even if the further investments might be acceptable and make sense, the implications can be a bigger shoe that one can wear. Then, the reversed saying can hold: 'for want of a nail, the shoe was lost; for want of a shoe the horse was lost; and for want of a horse the rider was lost'.

For IT technologies the impact is multidimensional as the change of a technology has impact on the IT infrastructure, on the processes associated with them, on the resources required and their skillset, respectively on the various types of flows (data, information, knowledge, materials, money).

Previous Post <<||>> Next Post

17 February 2021

📊🐍Python: Plotting Data with the Radar Chart

Today's task was to display a set of data using the radar chart available with the matplotlib.pyplot library. For this I considered the iris dataset available with the sklearn learning library. The dataset is stored as an array, therefore for further manipulation was converted into a data frame. As the radar chart allows comparing only a small set of numerical values, I considered displaying only the mean values for each type of iris (setosas versicolor, virginica).

Unfortunately, the radar chart doesn't seem to complete the polygons based on the available dataset, therefore as workaround I had to duplicate the first column within the result data frame. (It seems that the Ploty library does a better job at displaying radar charts, see example).

Here's the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd

#preparing the data frame 
iris = datasets.load_iris()

ds = pd.DataFrame(data = iris.data, columns = iris.feature_names)
dr = ds.assign(target = iris.target) #iris type

group_by_iris = dr.groupby('target').mean()
group_by_iris[''] = group_by_iris[iris.feature_names[0]] #duplicating the first column

# creating the graph
fig = plt.subplots()

angle = np.linspace(start=0, stop=2 * np.pi, num=len(group_by_iris.columns))

plt.figure(figsize=(5, 5))
plt.subplot(polar=True)

values = group_by_iris[:1].values[0]
plt.plot(angle, values, label='Iris-setosa', color='r')
plt.fill(angle, values, 'r', alpha=0.2)

values = group_by_iris[1:2].values[0]
plt.plot(angle, values, label='Iris-versicolor', color='g')
plt.fill(angle, values, 'g', alpha=0.2)

values = group_by_iris[2:3].values[0]
plt.plot(angle, values, label='Iris-virginica', color='b')
plt.fill(angle, values, 'b', alpha=0.2)

#labels
plt.title('Iris comparison', size=15)
labels = plt.thetagrids(np.degrees(angle), labels=group_by_iris.columns.values)
plt.legend()

plt.show()

Happy coding!

16 February 2021

📊🐍Python: Drawing Concentric Circles with matplotlib.pyplot

Today I tried using for the first time the matplotlib library for drawing a few concentric circles, though it proved to be a bit more challenging than expected, as the circles were distorted given the scale differences between x and y axis. Because of this the circles displayed via the Circle class (in blue) seem to be displayed as ellipses. To show the difference I used trigonometric functions to draw the circles (in green) by applying a 5/7.5 multiplication factor for the x axis:

And here's the code:

import numpy as np
import math as m
import matplotlib.pyplot as plt

axis_dimensions = [-100,100, -100,100] #dimensions axis
dx=10       #distance between ticks on x axis 
dy=10       #distance between ticks on y axis
sfx = 5/7.5 #scale factor for x axis
r= 50       #radius

#drawing the grid
plt.axis(axis_dimensions)
plt.axis('on')
plt.grid(True, color='gray')
plt.xticks(np.arange(axis_dimensions[0], axis_dimensions[1], dx))
plt.yticks(np.arange(axis_dimensions[2], axis_dimensions[3], dy))

#adding labels
plt.title('Circles')
plt.xlabel('x axis')
plt.ylabel('y axis')

#drawing the geometric figures
for i in range(0,51,10):
    for angle in np.arange(m.radians(0),m.radians(360),m.radians(2)):
        #drawing circles via trigonometric functions
        x = (r+i)*m.cos(angle)*sfx
        y = (r+i)*m.sin(angle)
        plt.scatter(x,y,s=2,color ='g')
        
    #drawing with circles
    circle = plt.Circle((0,0),r+i,color='b', fill=False)
    plt.gca().add_patch(circle)

plt.show()

Happy coding!

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

Data Migrations Series

The data requirements for the Data Migration (DM) and Data Quality (DQ) are driven by the processes implemented in the target system(s). Therefore, a good knowledge of these requirements can decrease the effort needed for these two subprojects considerably. The needed knowledge basis starts with the entities and their attributes, the dependencies existing between them and the various rules that apply, and ends with the parametrization requirements, respectively the architecture(s) that can be used to import the data.

The DM process starts with defining the entities in scope and their attributes, respectively identifying the corresponding entities and attributes from the legacy systems. The attributes not having a correspondent in the legacy system need to be provided by the business and integrated in the DM logic. In addition, it’s needed to consider also the attributes needed by the business and not available in the target system, some of them more likely available in the legacy systems. For such attributes is needed either to misuse an attribute from the target or to extend the target system.

For each entity is created a data mapping that basically documents the data transformations needed for migrating the data. In the process is needed to consider also attributes’ data types, the (standard) formatting, their domain of definition, as well the various rules that apply. Their implementation belongs into the DM layer from which the data are exported in a standard format as needed by the target system.

Exporting the data from the DM layer directly into the target system’s tables has in theory the lowest overhead even if the rejected records are difficult to track, the rejections resulting only from records’ ‘validation against database’s schema. For this approach to work, one must have a good knowledge of the database schema and of the business rules implemented into the target system.

To solve the issue with errors’ logging, systems have a further layer on top of the database model, which also allow running data validation against target system’s business rules. Modern import frameworks allow loading the data via a set of standard files with a predefined structure. The data can be thus imported manually or via load jobs into the system a log with the issues being generated in the process. Some frameworks allow even the manual editing of failed records, respectively to import the data. Unfortunately, calling the layer from the DM layer is not possible from a database, though this would bring seldom a benefit. Some third-party tools attempt to improve the import functionality by calling the target system’s import layer.

The import files must be generated from the DM layer in the required structure with the appropriate formatting. The challenge however resides in identifying all the attributes that should make scope of the load. It’s an iterative process which sometimes is backed by try-and-error heuristics. Unless target system’s validation rules are known beforehand, the rules need to be discovered in this process, which can prove time-consuming. The discoveries need to be integrated also in the DM and from here results the big number of changes that need to be performed.

Given the dependencies existing between entities the files need to be generated and loaded in a predefined order. These dependencies are reflected also in the data processing and the validation rules considered in the DM layer.

A quality checkpoint can be implemented between the export from the DM layer and import to enforce the four-eyes principle. It’s normally the last opportunity for trapping the eventual issues. A further quality check is performed after import by validating on whether the data were imported as expected.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

ETL tools are ideal for extracting the needed data from the legacy system(s). They offer a considerable number of connectors to standard databases that leverage legacy systems’ data access layers or own frameworks, both categories providing acceptable performance for a wide range of solutions. Otherwise, third-party connectors can be considered as well, though their advantage might reside in the extra features they bring out-of-the-box in the detriment of performance loss, and thus should be used with caution.

Besides that, ETL tools provide also rich visual functionality that allow users building complex pipelines with transformations that process the data as data go through the pipeline. Further features like data profiling or cleansing bring additional benefits.

As usually only a subset of the legacy data is needed for the migration, an ETL solution allows extracting only the data in scope as filtering and other logic can be used in the extraction mechanism. Whether one loads the tables or entities 1:1 or aggregates the data from multiple tables is a matter of choice, even if the former two approaches are usually recommended.

As alternative to an ETL tool is building own extraction layer based for example on a powerful data access layer like ADO.Net. This might prove to be a cheaper alternative especially when ETL capabilities aren’t needed. This depends also on the overall architectural approach. Attempting to build a desktop-based application for a DM can prove to be a foolhardy approach especially when dealing with a considerable volume of data. Moreover, it would be needed to build features that are already available in ETL tools (transformations, workflows) or databases (indexes for performance optimization, join-based logic).

When the volume of data exceeds the capabilities of ETL tools one can consider ELT tools which load first the data before applying any transformations on them. Such tools are designed for the processing of what is known as big data (data having high volume, high velocity, high variety and different veracity).

When considering the best data extraction approach, it’s important to know where the data will be stored for processing. Given that DMs are data processing intensive the best data storage solution for processing would be a modern relational database. Besides performance, scalability, security, concurrency, failover mechanism some databases offer the possibility to connect directly to other servers via server links functionality. Despite this latter feature an ETL tool can still have considerable advantages for data extraction.

On the other side the DM logic can be in theory built entirely in the ETL tool without storing the data within a database, though this adds a high overhead on the server resources on which the solution runs as all the data needed for processing need to be loaded in memory. Even if the data are loaded in batches and processed as the batches go through the pipeline, the complexity of the processing can make challenging implementing any optimization techniques directly into the ETL tool. Moreover, fully ETL-based solutions are difficult to troubleshoot and change as the requirements change.

To address the high resources’ consumption of the ETL tools one can store the intermediary results into database tables on which indexes can be created for performance optimization. Moreover, the logic can be encapsulated in database objects and used in the processing. This approach enables troubleshooting, performing validations and restarting the processing from a given step in the detriment of splitting the logic between multiple ETL packages. This can be an acceptable price to pay for more flexibility. Given that most ETL transformations can be replaced with SQL-based logic the ETL tool can be used only for data extraction.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part IV: Data Access)

Data Migrations Series

Once the data sources for a Data Migration (DM) were identified the first question is how the data can be accessed. The legacy systems relying on ODBC-based databases are in theory relatively easy to access as long they allow the direct access to their data, which would enable thus a pull strategy. Despite this, there are organizations that don’t allow the direct access to the data even for read-only operations, being preferred to push the data directly to the consumers (aka push strategy) or push the data to a given location from where the consumer can use the data as needed (aka hybrid strategy).

The direct access to the data allows in theory the best flexibility as the solution can extract the data when needed and this especially important during the initial phases of the project when the data need to be pulled more frequently until the requirements and logic is stabilized. A push strategy tends to add additional overhead as usually somebody else oversees the data exports, respectively the data need to be prepared in the expected format. On the other side, it would make sense to make an exception for a DM and allow the direct access to the data.

Hybrid strategies tend to be more complex and require additional resources or overhead as the data are stored temporarily at a separate location. Unfortunately, in certain scenarios this is the only approach can be used. Are preferred data files that allow keeping the integrity of the data and facilitate data consumption. Therefore, tabular text files or JSON files are preferred in the detriment of XML or Excel files. It’s preferable to export one data structure individually then storing parent-child solutions even if the latter can prove to be useful in certain scenarios. When there’s no other solution one can use also the standard reports available in the legacy systems.

When storing data outside the legacy systems for further processing it’s recommended to follow organization’s best practices, respectively to address the data security and privacy requirements. ETL tools allows accessing data from password protected areas like FTP, OneDrive or SharePoint. The fewer security layers in between the lower is in theory the overhead. Therefore, given its stability and simplicity FTP might prove to be a better storage solution than OneDrive, SharePoint or other similar technologies.

Ideally the extraction/export mechanisms should use the database objects that encapsulate already the logic in the legacy systems otherwise the team will need to reengineer the logic – for master data this can prove to be easy, though the logic of transactional data like on-hand or open invoices can be relatively complex to reengineer. Otherwise, the logic can be implemented directly in the extraction/export mechanisms or sometimes is more advisable to create database objects (usually on a different schema) on the legacy systems and just call the respective objects.

When connecting directly to the data source it’s advisable using the data provider which allows the best performance and flexibility, however several tests might be needed to determine the best fit. It would be useful to check the limitations of each provider and find a stable driver version. OLEDB and ADO.Net data providers provide in general a good performance, though native drivers of the legacy systems can be a better option upon case.

Some legacy systems allow the access to their data only via service-based technologies like OData. OData tends to have poor performance for large data exports than standard access methods and therefore not indicated in such scenarios. In such cases might be a good idea to export the data directly from the legacy system.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

Probably one of the most difficult things to learn as a technical person is using the right technology for a given purpose, this mainly because one’s inclined using the tools one knows best. Moreover, technologies’ overlapping makes the task more and more challenging, the difference between competing technologies often residing in the details. Thus, identifying the gaps resumes in understanding the details of the problem(s) or need(s), respectively the advantages or disadvantages of a technology over the other. This is true especially about competing technologies, including the ones that replace other technologies.

There are simple heuristics, that can allow approaching such challenges. For example, heavy data processing belongs usually in databases, while import/export functionality belongs in an ETL tool. Therefore, one can start looking at the problems from these two perspectives. Would the solution benefit from these two approaches or are there more appropriate technologies (e.g. data streaming, ELT, non-relational databases)? How much effort would involve building the solution?

Commercial Off-The-Shelf (COTS) tools provided by third-party vendors usually offer specialized functionality in each area. Gartner and Forrester provide regular analyses of the main players in the important areas, analyses which can be used in theory as basis for further research. Even if COTS tend to be more expensive and can have some important functionality gaps, as long they are extensible, they can prove a good starting point for developing a solution.

Sometimes it helps researching on the web what other people or organizations did, how they approached the same aspects, what technologies, techniques and best practices they used to overcome the challenges. One doesn’t need to reinvent the wheel even if it’s sometimes fun to do so. Moreover, a few hours of research can give one a basis of useful information and a better understanding over the work ahead.

On the other side sometimes it’s advisable to use the tools one knows best, however this can lead also to unusable and less performant solutions. For example, MS Excel and Access have been for years the tools of choice for building personal solutions that later grew into maintenance nightmares for the IT team. Ideally, they can still be used for data entry or data cleaning, though building solutions exclusively based on (one of) them can prove to be far than optimal.

When one doesn’t know whether a technology or mix of technologies can be used to provide a solution, it’s recommended to start a proof-of-concept (PoC) that would allow addressing most important aspects of the needed solution. One can start small by focusing on the minimal functionality needed to check the main aspects and evolve the PoC during several iterations as needed.

For example, in the case of a Data Migration (DM) this would involve building the data extraction layer for an entity, implement several data transformations based on the defined mappings, consider building a few integrity rules for validation, respectively attempt importing the data into the target system. Once this accomplished, one can start increasing the volume of data to check how the solution behaves under stress. The volume of data can be increased incrementally or by considering all the data available.

As soon the skeleton was built one can consider all the mappings, respectively add several entities to build the dependencies existing between them and other functionality. The prototype might not address all the requirements from the beginning, therefore consider the problems as they arise. For example, if the volume of data seems to cause problems then attempt splitting the data during processing in batches or considering specific optimization techniques like indexing or scaling techniques like increasing computing resources.

Previous Post <<||>> Next Post

📦Data Migrations (DM): Conceptualization (Part II: Plan vs. Concept vs. Strategy)

Data Migrations Series

A concept is a document that describes at high level the set of necessary steps and their implications to achieve a desired result, typically making the object of a project. A concept is usually needed to provide more technical and nontechnical information about the desired solution, the context in which a set of steps are conducted, respectively the changes considered, how the changes will be implemented and the further aspects that need to be considered. It can include a high-level plan and sometimes also information that typically belong in a Business Case – goals,objectives, required resources, estimated effort and costs, risks and opportunities.

A concept is used primarily as basis for sign-off as well for establishing common ground and understanding. When approved, it’s used for the actual implementation and solution’s validation. The concept should be updated as the project progresses, respectively as new information are discovered.

Creating a concept for a DM can be considered as best practice because it allows documenting the context, the technical and organizational requirements and dependencies existing between the DM and other projects, how they will be addressed. The concept can include also a high-level plan of the main activities (following to be detailed in a separate document).

Especially when the concept has an exploratory nature (due to incomplete knowledge or other considerations), it can be validated with the help of a proof-of-concept (PoC), the realization of a high-level-design prototype that focuses on the main characteristics of the solution and allows thus identifying the challenges. Once the PoC implemented, the feedback can be used to round out the concept.

Building a PoC for a DM should be considered as objective even when the project doesn’t seem to meet any major challenges. The PoC should resume in addressing the most important DM requirements, ideally by implementing the whole or most important aspects of functionality (e.g. data extraction, data transformations, integrity validation, respectively the import into the target system) for one or two data entities. Once the PoC built, the team can use it as basis for the evolutive development of the solution during the iterations considered.

A strategy is a set of coordinated and sustainable actions following a set of well-defined goals, actions devised into a plan and designed to create value and overcome further challenges. A strategy has the character of a concept though it has a broader scope being usually considered when multiple projects or initiatives compete for the same resources to provide a broader context and handle the challenges, risks and opportunities. Moreover, the strategy takes an inventory of the current issues and architecture – the 'AS-IS' perspective and sketches the to 'TO-BE' perspective by devising a roadmap that bridges the gap between the two.

In the case of a DM a strategy might be required when multiple DM projects need to be performed in parallel or sequentially, as it can help the organization to better manage the migrations.

A plan is a high-level document that describes the tasks, schedule and resources required to carry on an activity. Even if it typically refers to the work or product breakdown structure, it can cover other information usually available in a Business Case. A project plan is used to guide both project execution and project control, while in the context of Strategic Management the (strategic) plan provides a high-level roadmap on how the defined goals and objectives will be achieved during the period covered by the strategy.

For small DM projects a plan can be in theory enough. As both a strategy and a concept can include a high-level plan, the names are in praxis interchangeable.

Previous Post <<||>> Next Post

02 February 2021

📦Data Migrations (DM): Conceptualization I (Goals, Objectives & Requirements)

Data Migrations Series

One of the nowadays’ challenges is finding the right mix of technologies that allows building a solution for a business need. There are so many choices and the responsible person is easily tempted to use one of the trending technologies just because he wants to learn something new or the technologies seem to fit into the bigger picture, which probably in many cases it would be acceptable. Unfortunately, there’s also the tendency of picking a technology without looking at what functionality it provides, respectively whether the functionality meets intended solutions’ requirements. Moreover, the requirements are sometimes barely defined at the appropriate level of detail, fact that makes from the implementation project a candidate for failure. Sometimes even the goals and objectives aren’t clearly stated, fact that can make a project’s success easily questionable from the beginning.

A goal is a general statement that reflects the desired result toward which an organization’s effort needs to be directed. For example, a Data Migration (DM)’s primary goal can be formulated as 'to make available all the master and transactional data needed by the business from the legacy systems to the target system(s) within expected timeline and quality with a minimal disruption for the business'.

An objective is a break down of the goal into several components that should foster a clear understanding on how the goal will be achieved. Ideally the objectives should be SMART (specific, measurable, attainable, relevant, time-bound), even if measurable objectives are sometimes hard to define properly. One can consider them as the tactics used in achieving the goal. For example, the above formulated goal can be broken down into the following objectives:

Build a DM concept/strategy
Build a flexible and performant infrastructure for DM that can be adapted to further requirements
Provide a basis for further DMs
Align DM and main project’s requirements and activities
Provide an interface and support for the Data Management areas
Foster trust, transparency and awareness
Address internal/external compliance requirements
Document and communicate accountability for the various activities
Cleanse and enrich the data needed by the target system
Archive the DM and project data

One can attempt defining the objectives directly from the goal(s), though unless one is aware of all the implication a DM has, more likely one will be forced to define and evaluate the individual functional and nonfunctional requirements for the DM first, and attempt consolidating the requirements into a set of objectives. In the end it can be a combination of both, in which some objectives are first formulated, the requirements are defined and evaluated, respectively the objectives are refined to accommodate the requirements.

ISO 9126, an international standard for the evaluation of software quality, defines about 45-50 attributes that can be used for addressing the requirements of software solutions, attributes that reflect functionality, reliability, usability, efficiency, and maintainability characteristics. One can start with such a list and identify how important are the respective attributes for the solution. The next step would be to document the requirements into a consolidated list by providing a short argumentation for their use, respectively how they will be addressed as part of the solution. The process can prove to be time-consuming, however it is a useful exercise that usually needs to be done only once and be reviewed occasionally.

The list can be created independently of any other documentation or be included directly into a concept or strategy. The latter will assure in theory that the document provides a unitary view of the migration, considering that each new or obsolete requirement can impact the concept.

Previous Post <<||>> Next Post

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part V: Quality Acceptance Criteria V)

Efficiency

Efficiency is the degree to which a solution uses the hardware (storage, network) and other organizational resources to fulfill a given task. Data characterized by high volume, velocity, variety and veracity can be challenging to process, requiring upon case more processing power. Therefore, the DM solutions need to consider these aspects as well. However, efficiency refers on whether the available resources are used efficiently – the waste in terms of resource utilization is minimal.

On the other side the waste of resources can be acceptable when there are other benefits or requirements that need to be considered, respectively when the ratio between resources utilization and effort to built more efficient processes is acceptable.

A DM solution involves iterative and exploratory processes in which knowledge and feedback is integrated in each iteration, therefore it might look like resources are not used efficiently. However, this is a way to handle complexity and uncertainty by breaking the effort in manageable chunks.

Learnability

Learnability is the degree to which a person can become familiar with a solution’s use, the data and the processes associated with it. A DM can be challenging for many technical and non-technical resources as it requires a certain level of skillset and understanding of the requirements, needs and deliverables. The complexity of the data and requirements can be overwhelming, however with appropriate communication and awareness established, the challenges can be overcome.

Stability

Stability is the degree to which a solution is sensitive to environment changes (e.g. overuse of resource, hardware or software failures, updates), respectively on whether it performs with no performance defects or it does not crash under defined levels of stress. Stability can be monitored during the various phases and countermeasures need to be considered in case the solution is not stable enough (e.g. redesigning the solution, breaking the data in smaller chunks)

Suitability

Suitability is the degree to which a solutions provides functions that meet the stated and implied needs. No matter how performant and technologically advanced a solution is, it brings less value as long it doesn’t perform what it was intended to do.

Transparency

Transparency is the degree to which a solution’s stakeholders have access to the requirements, processes, data, documentation, or other information required by them. In a DM transparency is important especially important in respect to the data, logic and rules used in data processing, respectively the number of records processed.

Trustability

Trustability is the degree to which a solution can be trusted to provide the expected results. Even if the technical team assures that the solution can deliver what was indented, the success of a DM is a matter of perception from stakeholders’ perspective. Providing transparency into the data, rules and processes can improve the level of trust however, special attention need to be given to the issues raised by stakeholders during and after Go-Live, as differences need to be mitigated.

Understandability

Understandability is the degree to which the requirements of a solution were understood by the resources involved in terms of what needs to be performed. For the average project resource it might be challenging to understand the implications of a DM, and this can apply to technical as well non-technical resources. Making people aware of the implications is probably one of the most important criteria for success, as the success of a migration is often a matter of perception.

Usability

Usability is the degree to which a solution can be used by the targeted users within the agreed context of usage. Ideally DM solutions need to be easy to use, though there are always trade-offs. In the end, a DM must fit the purpose it was built for.

Previous Post <

📦Data Migrations (DM): Quality Assurance (Part IV: Quality Acceptance Criteria IV)

Data Migrations Series

Reliability

Reliability is the degree to which a solution performs its intended functions under stated conditions without failure. In other words, a DM is reliable if it performs what was intended by design. The data should be migrated only when migration’s reliability was confirmed by the users as part of the sign-off process. The dry-runs as well the final iteration for the UAT have the objective of confirming solution’s reliability.

Reversibility

Reversibility is the degree to which a solution can return to a previous state without starting the process from the beginning. For example, it should be possible to reverse the changes made to a table by returning to the previous state. This can involve having a copy of the data stored respectively deleting and reloading the data when necessary.

Considering that the sequence in which the various activities is fix, in theory it’s possible to address reversibility by design, e.g. by allowing to repeat individual steps or by creating rollback points. Rollback points are especially important when loading the data into the target system.

Robustness

Robustness is the degree to which the solution can accommodate invalid input or environmental conditions that might affect data’s processing or other requirements (e,g. performance). If the logic can be stabilized over the various iterations, the variance in data quality can have an important impact on a solutions robustness. One can accommodate erroneous input by relaxing schema’s rules and adding further quality checks.

Security

Security is the degree to which the DM solution protects the data so that only authorized people have access to the respective data to the defined level of authorization as data are moved through the solution. The security provided by a solution needs to be considered against the standards and further requirements defined within the organization. In case no such standards are available, one can in theory consider the industry best practices.

Scalability

Scalability is the degree to which the solution is able to respond to an increased workload. Given that the number of data considered during the various iterations vary in volume, a solution’s scalability needs to be considered in respect to the volume of data to be migrated.

Standardization

Standardization is the degree to which technical standards were implemented for a solution to guarantee certain level of performance or other aspects considered as import. There can be standards for data storage, processing, access, transportation, or other aspects associated with the migration processes. Moreover, especially when multiple DMs are in scope, organizations can define a set of standards and guidelines that should be further considered.

Testability

Testability is the degree to which a solution can be tested in the respect to the set of functional and data-related requirements. Even if for the success of a migration are important the data in their final form, to achieve that is needed to validate the logic and test thoroughly the transformations performed on the data. As the data go trough the data pipelines, they need to be tested in the critical points – points where the data suffer important transformations. Moreover, one can consider record counters for the records processed in each such critical point, to assure that no record was lost in the process.

Traceability

Traceability is the degree to which the changes performed on the data can be traced from the target to the source systems as record, respectively at entity level. In theory, it’s enough to document the changes at attribute level, though upon case it might needed to document also the changes performed on individual values.

Mappings at attribute level allow tracing the data flow, while mappings at value level allow tracing the changes occurrent within values.

Previous Post <> Next Post

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

Data Migrations Series

Repeatability

Repeatability is the degree with which a DM can be repeated and obtain consistent results between repetitions. Even if a DM is supposed to be a one-time activity for a project, to guarantee a certain level of quality it’s important to consider several iterations in which the data requirements are refined and made sure that the data can be imported as needed into the target system(s). Considered as a process, as long the data and the rules haven’t changed, the results should be the same or have the expected level of deviation from expectations.

This requirement is important especially for the data migrated during UAT and Go-Live, time during which the input data and rules need to remain frozen (even if small changes in the data can still occur). In fact, that’s the role of UAT – to assure that the data have the expected quality and when compared to the previous dry-run, that it attains the expected level of consistency.

Reusability

Reusability is the degree to which the whole solution, parts of the logic or data can be reused for multiple purposes. Master data and the logic associated with them have high reusability potential as they tend to be referenced by multiple entities.

Modularity

Modularity is the degree to which a solution is composed of discrete components such that a change to one component has minimal impact on other components. It applies to the solution itself but also to the degree to which the logic for the various entities is partitioned so to assure a minimal impact.

Partitionability

Partitionability is the degree to which data or logic can be partitioned to address the various requirements. Despite the assurance that the data will be migrated only once, in practice this assumption can be easily invalidated. It’s enough to increase the system freeze by a few days and/or to have transaction data that suddenly requires master data not considered. Even if the deltas can be migrated in system manually, it’s probably recommended to migrate them using the same logic. Moreover, the performing of incremental loads can be a project requirement.

Data might need to be partitioned into batches to improve processing’s performance. Partitioning the logic based on certain parameters (e.g. business unit, categorical values) allows more flexibility in handling other requirements (e.g. reversibility, performance, testability, reusability).

Performance

Performance refers to the degree a piece of software can process data into an amount of time considered as acceptable for the business. It can vary with the architecture and methods used, respectively data volume, veracity, variance, variability, or quality.

Performance is a critical requirement for a DM, especially when considering the amount of time spent on executing the logic during development, tests and troubleshooting, as well for other activities. Performance is important during dry-runs but more important during Go-Live, as it equates with a period during which the system(s) are not available for the users. Upon case, a few hours of delays can have an important impact on the business. In extremis, the delays can sum up to days.

Predictability

Predictability is the degree to which the results and behavior of a solution, respectively the processes involve are predictable based on the design, implementation or other factors considered (e.g. best practices, methodology used, experience, procedures and processes). Highly predictable solutions are desirable, though reaching the required level of performance and quality can be challenging.

The results from the dry-runs can offer an indication on whether the data migrated during UAT and Go-Live provide a certain level of assurance that the DM will be a success. Otherwise, an additional dry-run should be planned during UAT, if the schedule allows it.

Previous Post <> Nest Post

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

Data Migrations Series

Auditability

Auditability is the degree to which the solution allows checking the data for their accuracy, or for their quality in general, respectively the degree to which the DM solution and processes allow to be audited regarding compliance, security and other types of requirements. All these aspects are important in case an external sign-off from an auditor is mandatory.

Automation

Automation is the degree to which the activities within a DM can be automated. Ideally all the processes or activities should be automated, though other requirements might be impacted negatively. Ideally, one needs to find the right balance between the various requirements.

Cohesion

Cohesion is the degree to which the tasks performed by the solution, respectively during the migration, are related to each other. Given the dependencies existing between data, their processing and further project-related activities, DM imply a high degree of cohesion that need to be addressed by design.

Complexity

Complexity is the degree to which a solution is difficult to understand given the various processing layers and dependencies existing within the data. The complexity of DM revolve mainly around the data structures and the transformations needed to translate the data between the various data models.

Compliance

Compliance is the degree to which a solution is compliant with internal or external regulations that apply. There should be differentiated between mandatory requirements, respectively recommendations and other requirements.

Consistency

Consistency is the degree to which data conform to an equivalent set of data, in this case the entities considered for the DM need to be consistent to each other. A record referenced in any entity of the migration need to be considered, respectively made available in the target system(s) either by parametrization or migration.

During each iteration, the data need to remain consistent, so it can facilitate the troubleshooting. The data are usually reimported between iterations or during same iteration, typically to reflect the changes occurred in the source systems or other purposes.

Coupling

Data coupling is the degree to which different processing areas within a DM share the same data, typically a reflection of the dependencies existing between the data. Ideally, the areas should be decoupled as much as possible.

Extensibility

Extensibility is the degree to which the solution or parts of the logic can be extended to accommodate further requirements. Typically, this involves changes that deviate from the standard functionality. Extensibility impacts positively the flexibility.

Flexibility

Flexibility is the degree to which a solution can handle new requirements or ad-hoc changes to the logic. No matter how good everything was planned there’s always something forgotten or new information is identified. Having the flexibility to change code or data on the fly can make an important difference.

Integrity

Integrity is the degree to which a solution prevents the changes to data besides the ones considered by design. Users and processes should not be able modifying the data besides the agreed procedures. This means that the data need to be processed in the sequence agreed. All aspects related to data integrity need to be documented accordingly.

Interoperability

Interoperability is the degree to which a solution’s components can exchange data and use the respective data. The various layers of a DM’s solutions must be able to process the data and this should be possible by design.

Maintainability

Maintainability is the degree to which a solution can be modified to or add minor features, change existing code, corrects issues, refactor code, improve performance or address changes in environment. The data required and the transformation rules are seldom known in advance. The data requirements are definitized during the various iterations, the changes needing to be implemented as the iterations progress. Thus, maintainability is a critical requirement.

Previous Post - Next Post

📦Data Migrations (DM): Quality Assurance (Part I: Quality Acceptance Criteria I)

Data Migrations Series

Introduction

When designing a Data Migration (DM), respectively any software solution, it’s important to take inventory of project’s requirements, evaluate, document, communicate and monitor them accordingly. Each of them can have an important impact on the solution, as a solution’s success will be validated and judged upon them. Therefore, the identified requirements must be considered as baseline for conceptualization, design, implementation and sign-off, and should go through same procedures and rigor as other projects requirements. The existence of a standardized Requirements Management process can facilitate their management through project’s lifecycle.

The requirements are usually driven by the source and target systems (e.g. data import/export features, data models and their constraints), the environments they are hosted on (e.g. cloud vs. on-premise), respectively the layers in between (e.g. network, firewalls), project and business aspects that need to be considered (e.g. freeze window for the Go-Live, data availability dates, data quality, external dependencies, etc.). They resume to the solution itself as well to the data and processes involved, and are reflected but not limited to the following important aspects, that can be considered upon case also as quality acceptance criteria:

Accessibility

Accessibility is the degree to which the data are available for a solution so it can be processed when needed, in the form, by resources, or means intended for processing. It’s critical for a DM solution to access or have available the master, transaction, parameter and further data when needed. The team must make sure that the data become easily accessible.

Unavailability of data can impact the DM and can easily lead to delays in the project. This also means that the various project activities (parametrization, cleansing, enrichment, development) need to be synchronized with the migration activities.

Upon case, accessibility can involve the solution itself expressed as the degree to which it’s available to the resources supposed to use it. Certain architectural decisions can have impact on the carried activities. As the solution is usually deployed on a server, it can happen that only a limited number of people is able to access it concurrently. Moreover, a DM’s complexity makes the involvement of multiple developers challenging.

Accountability

Accountability is the degree to which accountability is enforced for the various resources involved in DM processes and related activities. As multiple resources are involved for parametrization, cleaning, processing, validation, software development, each resource needs to be aware about the extent they are accountable for. Without accountability made explicit, there’s the danger that the activities are neglected, with all the implications deriving from it – quality deviations, delays, data unavailability, etc.

Adaptability

Adaptability is the degree to which a solution can be adapted to environment or requirement changes. Even if typically, the environments don’t change, it doesn’t mean that this will not happen as the IT infrastructure goes through continuous changes that can affect directly or indirectly a migration. Same can be said about requirements, which however have higher probability to change even late in the process as new knowledge is acquired and needs to be integrated in the solution.

Atomicity

Atomicity is the degree to which data entities can be processes at the required level of abstraction in an atomic manner. Even if transformations occur during the various stages, the data belonging to an entity need to be kept and processed together (e.g. Customers and their Addresses). This can involve processing attributes in advance even if the data might be required later. There can be situations in which the data belonging to the same entity need to be processed on different paths, though in the end it’s important to keep the data together, when the processing logic allows it.

Next Post

SQL Troubles

Pages

22 February 2021

𖣯Strategic Management: The Impact of New Technologies (Part I: A Nail Keeps the Shoe)

17 February 2021

📊🐍Python: Plotting Data with the Radar Chart

16 February 2021

📊🐍Python: Drawing Concentric Circles with matplotlib.pyplot

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VII: Data Import Layer)

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

03 February 2021

📦Data Migrations (DM): Conceptualization (Part V: Data Extraction Layer)

📦Data Migrations (DM): Conceptualization (Part IV: Data Access)

📦Data Migrations (DM): Conceptualization (Part III: Heuristics)

📦Data Migrations (DM): Conceptualization (Part II: Plan vs. Concept vs. Strategy)

02 February 2021

📦Data Migrations (DM): Conceptualization I (Goals, Objectives & Requirements)

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part V: Quality Acceptance Criteria V)

📦Data Migrations (DM): Quality Assurance (Part IV: Quality Acceptance Criteria IV)

📦Data Migrations (DM): Quality Assurance (Part III: Quality Acceptance Criteria III)

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

📦Data Migrations (DM): Quality Assurance (Part I: Quality Acceptance Criteria I)

About Me