24 April 2010

Troubleshooting – Part I: The Problem Solving Approach


    In several occasions I observed that there are SQL and non-SQL developers who don’t know how to troubleshoot a programming problem in general, respectively a SQL-related issue in particular, and I’m not referring here to the complex problems that typically require the expertise of a specialist, but simple day to day situations: troubleshooting an error thrown by the database engine, an error in the logic, a performance issue, unavailability of resources, etc. I’m not necessarily talking here about the people posting questions on forums, professional networks or blogs, even if in many situations they could have found an answer to the problem by doing a little research, but seeing developers actually at work. It’s true that there are also many cases in which the software throws an error message that you don’t know from where to start or that the error is pointed as appearing at other line than at the line where actually the error occurs, leading the developer to a false thread.

   Before going into detail let’s take a short look at troubleshooting and what it means! Paraphrasing Wikipedia’s general definition for troubleshooting, troubleshooting in IT is a type of problem solving applied to software and infrastructure related issues. Software issues refer not only to the various types of errors thrown by software applications, but also to functional, rendering or configuration errors, performance issues, data quality issues, etc. Infrastructure related issues could refer to the IT infrastructure – network, information systems, processes, methods or methodologies used. In this post I will refer only to the software issues even if the technique(s) for troubleshooting this kind of issues could be applied also to infrastructure issues.

Polya’s Approach to Problem Solving

     In his book ‘’How To Solve It’, G. Polya, a well known Hungarian mathematician, advances a 4 step natural approach in solving a problem: 1. understanding the problem, 2. devising a plan, 3. carrying out the plan, and 4. looking back [1]. G. Polya’s approach could be used for all types of problems, including IT problems, and even if I find this approach too high level for solving this type of problems, it’s actually a cornerstone on which more detailed approaches could be used. Let’s look shortly at each of Polya’s four steps!

1. Understanding the problem
    Understanding the problem resumes in identifying what is known, the data, the actual facts, and what is not known, what causes the issue and how it will be solved. Somebody was saying that a problem well understood is half solved, and there are quite good chances to arrive to the wrong solution if the problem is not well understood. If in Mathematics the problem is defined beforehand together with the whole range of constraints, in IT for example, when troubleshooting the problem needs to be defined,  in the context of this post the problem revolving around a technical or business issue appearing in the form of an error message, un unexpected/wrong application behavior, wrong process, etc. Thus the actual facts could resume to the error message, current vs. expected behavior, tools used, high/low level design, business logic, affected objects, specific constraints, etc.
    Defining the issue/problem might not be as simple as it seems, especially when the issue is pointed by other people in informal non-technical terminology, fuzzy formulations like “there is an error in the XYZ screen” without actually detailing what the issue is about, the steps followed and the input that resulted in the respective issue, and other such aspects that need to be addressed in order to understand the problem. All these aspects are not known by the developer though with a little investigation they are transformed in known information, this involving communication with the users, looking in documentation, and gathering any other facts. Actually we could group all this actions under “gathering the facts” syntagm, and this type of operations could be considered as part of this step because they are intrinsic in what concerns problem understanding.

2. Devising a plan
     In this step is attempted to find the connection between the data and the unknown, looking at the problem from different angles in order to obtain an idea of the solution, to make a plan [1]. We have a plan when we know which steps we have to follow in order to identify the issue (solve the problem), they don’t have to be too detailed, but addressable, not necessarily complete but as a base that could be evolved with time, for example when new information/results are found. It could be multiple directions to look into, for example based on possible list of causes, constraints the various features comes with, different features for implementing the same thing, etc.
       Naturally the first question a developer should ask: have I seen this issue before in actual or slightly modified form? Could be the problem broken down to smaller (known) problems? Could be derived anything useful from the data, have been considered all essential notions involved in the problem [1]? Essential notions, that’s always a thing to look into, mainly because I would say that many issues derive from feature constraints or from misuse of features. There could be used tools brainstorming, check lists, root-cause analysis, conceptual mapping, in fact any tool which could help us to track the essential notions and the relations between them.

3. Carrying out the plan
    Once the plan sketched, we could go on and approach each of the branches of the plan, performing the successive steps in one branch until we find an end-point (a point in which we can’t go further). There could be branches going nowhere, multiple solutions, or no apparent solution to the problem. Everything is possible… More likely while advancing in carrying out the plan, we could discover other intermediary steps, other branches (alternatives of arriving to the same result or to approach different constraints).

4. Looking back
    According to Polya, this step resumes to examining the solution [1], reviewing the argumentation used, solution’s construction, on whether the solution is optimal, on whether it could be reused to solve other types of problems or whether it could be improved/refactored. Actually this is a step many developers completely ignore, they found a solution, it’s working so their work is done! No, even when pressed by time should be considered also these aspects of problem solving, and from my point of view this step includes also steps like documenting the issue, and in special cases communicating the solution found to the circle of professionals (e.g. in terms of best practices or lessons learned, why not a blog post, etc.). Topics like optimality and refactoring  and are quite complex and deserve a post of their own, therefore I will resume myself to mention only the fact that they typically consider the solution from the point of view of performance, complexity, (re)usability and design, the developer having to trade between them and other similar (quality) dimensions.

Beyond Polya’s Approach

    A natural question: do we really have to follow this approach?! Come on, there will be cases when you’ll have the solution without actually attempting to define the problem (explicitly) or devise a plan (explicitly), or only by listing the scope and the constraints! Unconsciously we are actually following the first three steps, but forget or complete ignore the fourth, and I feel that applying Polya’s approach brings some “conscious thought” in this process that could help us make the most of it.
     In many cases the solution will be there in documentation, giving developers some explicit or implicit hints about the areas in which to search, for example in case of an error related to a query a first input is the error message returned by the database engine. Fortunately RDBMS vendors like Microsoft and Oracle provide also a longer description for each error, allowing thus to understand what the error message is about. This is the happiest case, there are also many software tools that after they run half of hour, they return a fuzzy error message (e.g. ‘an error occurred!’), and nothing more.

     Thank God for the internet, a dynamic knowledge repository, in which lot of knowledge could be find with just a simple click, but also sensitive to the input. In many cases I could found one or more solutions or hints for an error I had, usually just by copy pasting the error number of the error description, or when the description is too long, only the most important part. I observed, that there is an quite important number of professionals that prefer to post their issue in a forum or professional group instead of doing some research by themselves, this lack of effort helping to increase the volume of redundant information on the web, this coming with negative but also positive implications. When we perform such a search on internet we actually rely on the solution provided by other users, shortcutting the troubleshooting process, and with the risk of repeating the same syntagm, it comes with negative but also positive implications, for example a negative aspect is that people don’t learn how to troubleshoot by themselves relying instead on ready-available solution, while a positive aspect is that less time is spent within troubleshooting process, at least in theory. Actually, considering the mentioned positive aspect, that’s also why I consider as important the “looking back” step, and I’m referring especially at documenting the issue action.

[1] G. Polya (1973) How To Solve It: A New Aspect of Mathematical Method, 2nd Ed.Stanford University.  ISBN: 0-691-08097-6.

No comments:

Related Posts Plugin for WordPress, Blogger...