SQL Troubles: iterations

Showing posts with label iterations. Show all posts

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

Business Intelligence Series

There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes.

There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

Previous Post <<||>> Next Post

References:
[1] Alexander Pope (cca. 1711) An Essay on Criticism

17 February 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part II: Major Knowledge Gaps)

Business Intelligence Series

Solving a problem requires a certain degree of knowledge in the areas affected by the problem, degree that varies exponentially with problem's complexity. This requirement applies to scientific fields with low allowance for errors, as well as to business scenarios where the allowance for errors is in theory more relaxed. Building a report or any other data artifact is closely connected with problem solving as the data artifacts are supposed to model the whole or parts of what is needed for solving the problem(s) in scope.

In general, creating data artifacts requires: (1) domain knowledge - knowledge of the concepts, processes, systems, data, data structures and data flows as available in the organization; (2) technical knowledge - knowledge about the tools, techniques, processes and methodologies used to produce the artifacts; (3) data literacy - critical thinking, the ability to understand and explore the implications of data, respectively communicating data in context; (4) activity management - managing the activities involved.

At minimum, creating a report may require only narrower subsets from the areas mentioned above, depending on the complexity of the problem and the tasks involved. Ideally, a single person should be knowledgeable enough to handle all this alone, though that's seldom the case. Commonly, two or more parties are involved, though let's consider the two-parties scenario: on one side is the customer who has (in theory) a deep understanding of the domain, respectively on the other side is the data professional who has (in theory) a deep understanding of the technical aspects. Ideally, both parties should be data literates and have some basic knowledge of the other party's domain.

To attack a business problem that requires one or more data artifacts both parties need to have a common understanding of the problem to be solved, of the requirements, constraints, assumptions, expectations, risks, and other important aspects associated with it. It's critical for the data professional to acquire the domain knowledge required by the problem, otherwise the solution has high chances to deviate from the expectations. The general issue is that there are multiple interactions that are iterative. Firstly, the interactions for building the needed common ground. Secondly, the interaction between the problem and reality. Thirdly, the interaction between the problem and parties’ mental models und understanding about the problem.

The outcome of these interactions is that the problem and its requirements go through several iterations in which knowledge from the previous iterations are incorporated successively. With each important piece of knowledge gained, it's important to revise and refine the question(s), respectively the problem. If in each iteration there are also programming and further technical activities involved, the effort and costs resulted in the process can explode, while the timeline expands accordingly.

There are several heuristics that could be devised to address these challenges: (1) build all the required knowledge in one person, either on the business or the technical side; (2) make sure that the parties have the required knowledge for approaching the problems in scope; (3) make sure that the gaps between reality and parties' mental models is minimal; (4) make sure that the requirements are complete and understood before starting the development; (5) adhere to methodologies that accommodate the necessary iterations and endeavor's particularities; (6) make sure that there's a halt condition for regularly reviewing the progress, respectively halting the work; (7) build an organizational culture to support all this.

The list is open, and the heuristics aren't exclusive, so in theory any combination of them can be considered. Ideally, an organization should reflect all these heuristics in one form or another. The higher the coverage, the more mature the organization is. The question is how organizations with a suboptimal setup can change the status quo?

Previous Post <<||>> Next Post

04 February 2021

📦Data Migrations (DM): Conceptualization (Part VI: Data Migration Layer)

Data Migrations Series

Besides migrating the master and transactional data from the legacy systems there are usually three additional important business requirements for a Data Migration (DM) – migrate the data within expected timeline, with minimal disruption for the business, respectively within expected quality levels. Hence, DM’ timeline must match and synchronize with main project’s timeline in terms of main milestones, though the DM needs to be executed typically within a small timeframe of a few days during the Go-Live. In what concerns the third requirement, even if the data have high quality as available in the source systems or provided by the business, there are aspects like integration and consistency that rely primarily on the DM logic.

To address these requirements the DM logic must reach a certain level of performance and quality that allows importing the data as expected. From project’s beginning until UAT the DM team will integrate the various information iteratively, will need to test the changes several times, troubleshoot the deviations from expectations. The volume of effort required for these activities can be overwhelming. It’s not only important for the whole solution to be performant but each step must be designed so that besides fast execution, the changes and troubleshooting must involve a minimum of overhead.

For better understanding the importance, imagine a quest game in which the character has to go through a labyrinth with traps. If the player made a mistake he’ll need to restart from a certain distant point in time or even from the beginning. Now imagine that for each mistake he has the possibility of going one step back try a new option and move forward. For some it may look like cheating though in this way one can finish the game relatively quickly. It would be great if executing a DM could allow the same flexibility.

Unfortunately, unless the data are stored between steps or each step is a different package, an ETL solution doesn’t provide the flexibility of changing the code, moving one step behind, rerunning the step and performing troubleshooting, and this over and over again like in the quest game. To better illustrate the impact of such approach let’s consider that the DM has about 40 entities and one needs to perform on average 20 changes per entity. If one is able to move forwards and backwards probably each change will take about a few minutes to execute the code. Otherwise rerunning a whole package can take 5-10 times or even more as this can depend on packages’ size and data volume. For 800 changes only an additional minute per change equates with 800 minutes (about 13 hours).

In exchange, storing the data for an entity in a database for the important points of the processing and implementing the logic as a succession of SQL scripts allows this flexibility. The most important downside is that the steps need to be executed manually though this is a small price to pay for the flexibility and control gained. Moreover, with a few tricks one can load deltas as in the case of a phased DM.

To assure that the consistency of the data is kept one needs to build for each entity a set of validation queries that check for duplicates, for special cases, for data integrity, incorrect format, etc. The queries can be included in the sequence of logic used for the DM. Thus, one can react promptly to each unexpected value. When required, the validation rules can be built within reports and used in the data cleaning process by users, or even logged periodically per entity for tracking the progress.

Previous Post <<||>> Next Post

05 January 2021

🧮ERP: Implementations (Part III: It’s all about Scope I - Functional Requirements)

ERP Implementations Series

Introduction

ERP (Enterprise Resource Planning) Implementations tend to be expensive projects, often the actual costs overrunning the expectations by an important factor. The causes for this are multiple, the most important ones ranging from the completeness and complexity of the requirements and the impact they have on the organization to the availability of internal and external skilled resources, project methodology, project implementation, organization’s maturity in running projects, etc.

The most important decision in an ERP implementation is deciding what one needs, respectively what will be considered for the implementation, aspects reflected in a set of functional and nonfunctional requirements.

Functional Requirements

The functional requirements (FRs) reflect the expected behavior of the system in respect to the inputs and outputs – what the system must do. Typically, they encompass end-users’ requirements in the area of processes, interfaces and data processing, though are not limited to them.

The FRs are important because they reflect the future behavior of the system as perceived by the business, serving further as basis for identifying project’s scope, the gaps between end-users’ requirements and system’s functionality, respectively for estimating project’s duration and areas of focus. Further they are used as basis for validating system’s behavior and getting the sign-off for the system. Therefore, the FRs need to have the adequate level of detail, be complete, clear, comprehensible and implementable, otherwise any gaps in requirements can impact the project in adverse ways. To achieve this state of art they need to go through several iterations in which the requirements are reevaluated, enhanced, checked for duplication, relevance or any other important aspect. In the process it makes sense to categorize the requirements and provide further metadata needed for their appraisal (e.g. process, procedure, owner, status, priority).

Once brought close to a final form, the FRs are checked against the functionality available in the targeted system, or systems when more systems are considered for evaluation. Ideally all the requirements can be implemented with the proper parametrization of the systems, though it’s seldom the case as each business has certain specifics. The gaps need to be understood, their impact evaluated and decided whether the gaps need to be implemented. In general, it’s recommended to remain close to the standard functionality, as each further gap requires further changes to the system, gaps that in time can generate further quality-related and maintenance costs.

It can become a tedious effort, as in the process an impact and cost-benefit analysis need to be performed for each gap. Therefore, gaps’ estimation needs to occur earlier or intermixed with their justification. Once the list of the FRs is finalized and frozen, they will be used for estimating the final costs of the project, identifying the work packages, respectively planning the further work. Once the FRs frozen, any new requirements or changes to requirements (including taking out a requirement) need to go through the Change Management process and all the consequences deriving from it – additional effort, costs, delays, etc. This can trigger again an impact and cost-benefit analysis.

The FRs are documented in a specification document (aka functional requirement specification), which is supposed to track all the FRs through their lifetime. When evaluating the FRs against system’s functionality it’s recommended to provide general information on how they will be implemented, respectively which system function(s) will be used for that purpose. Besides the fact that it provides transparence, the information can be used as basic ground for further discussions.

Seldom all the FRs will be defined upfront or complete. Moreover, some requirements will become obsolete during project’s execution, or gaps will be downgraded as standard and vice-versa. Therefore, it’s important to recollect the unexpected.

Previous Post <<||>> Next Post

11 May 2018

🔬Data Science: K-Means Algorithm (Definitions)

"A top-down grouping method where the number of clusters is defined prior to grouping." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An algorithm used to assign K centers to represent the clustering of N points (K< N). The points are iteratively adjusted so that each of the N points is assigned to one of the K clusters, and each of the K clusters is the mean of its assigned points." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k = n. The algorithm minimizes the total intra-cluster variance or the squared error function." (Dimitrios G Tsalikakis et al, "Segmentation of Cardiac Magnetic Resonance Images", 2009)

"The k-means algorithm assigns any number of data objects to one of k clusters." (Jules H Berman, "Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information", 2013)

"The clustering algorithm that divides a dataset into k groups such that the members in each group are as similar as possible, that is, closest to one another." (David Natingga, "Data Science Algorithms in a Week" 2nd Ed., 2018)

"K-Means is a technique for clustering. It works by randomly placing K points, called centroids, and iteratively moving them to minimize the squared distance of elements of a cluster to their centroid." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"It is an iterative algorithm that partition the hole data set into K non overlaping subsets (Clusters). Each data point belongs to only one subset." (Aman Tyagi, "Healthcare-Internet of Things and Its Components: Technologies, Benefits, Algorithms, Security, and Challenges", 2021)

[Non-scalable K-means:] "A Microsoft Clustering algorithm method that uses a distance measure to assign a data point to its closest cluster." (Microsoft Technet)

"An algorithm that places each value in the cluster with the nearest mean, and in which clusters are formed by minimizing the within-cluster deviation from the mean." (Microsoft, "SSAS Glossary")

31 December 2007

🏗️Software Engineering: Iterative Development (Just the Quotes)

"Systems with unknown behavioral properties require the implementation of iterations which are intrinsic to the design process but which are normally hidden from view. Certainly when a solution to a well-understood problem is synthesized, weak designs are mentally rejected by a competent designer in a matter of moments. On larger or more complicated efforts, alternative designs must be explicitly and iteratively implemented. The designers perhaps out of vanity, often are at pains to hide the many versions which were abandoned and if absolute failure occurs, of course one hears nothing. Thus the topic of design iteration is rarely discussed. Perhaps we should not be surprised to see this phenomenon with software, for it is a rare author indeed who publicizes the amount of editing or the number of drafts he took to produce a manuscript." (Fernando J Corbató, "A Managerial View of the Multics System Development", 1977)

"One of the purposes of planning is we always want to work on the most valuable thing possible at any given time. We can’t pick features at random and expect them to be most valuable. We have to begin development by taking a quick look at everything that might be valuable, putting all our cards on the table. At the beginning of each iteration the business (remember the balance of power) will pick the most valuable features for the next iteration." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"It is a myth that we can get systems 'right the first time'. Instead, we should implement only today’s stories, then refactor and expand the system to implement new stories tomorrow. This is the essence of iterative and incremental agility. Test-driven development, refactoring, and the clean code they produce make this work at the code level." (Robert C Martin, "Clean Code: A Handbook of Agile Software Craftsmanship", 2008)

"Agile methods universally rely on an incremental approach to software specification, development, and delivery. They are best suited to application development where the system requirements usually change rapidly during the development process. They are intended to deliver working software quickly to customers, who can then propose new and changed requirements to be included in later iterations of the system. They aim to cut down on process bureaucracy by avoiding work that has dubious long-term value and eliminating documentation that will probably never be used." (Ian Sommerville, "Software Engineering" 9th Ed., 2011)

"When you write a computer program you've got to not just list things out and sort of take an algorithm and translate it into a set of instructions. But when there's a bug - and all programs have bugs - you've got to debug it. You've got to go in, change it, and then re-execute … and you iterate. And that iteration is really a very, very good approximation of learning." (Nicholas Negroponte, "A 30-year history of the future", [Ted Talk] 2014)

"Feedback is what makes it iterative; otherwise, it is just mini-waterfall. Merely splitting use cases into stories does not make for iterative development if we wait until all stories are developed before we seek feedback. The point of splitting is to get feedback faster so that it can be incorporated into ongoing development. However, seeking stakeholder/user feedback for small batches of functionality (stories) is often not feasible with formal stage-gate processes. They were conceived with linear flows of large batches in mind." (Sriram Narayan, "Agile IT Organization Design: For Digital Transformation and Continuous Delivery", 2015)

"This is what the Agile Manifesto means when it says responding to change over following a plan. To maximize adaptability, it is essential to have good, fast feedback loops. This is why there is so much emphasis on iterative development." (Sriram Narayan, "Agile IT Organization Design: For Digital Transformation and Continuous Delivery", 2015)

SQL Troubles

Pages