Systematic Debugging

Photo Credit: Thomas Altfather Good

During my first few years as a team lead, I worked with many new developers who struggled debugging large systems.  This motivated me to pay attention to people who could get to the bottom of almost anything: what did they do differently?

The difference I discovered was having systematic strategy: an algorithm. I documented this strategy and found it helped numerous developers get to the root of tough problems.

Since then, I’ve found many experienced developers also do not have an explicit algorithm for debugging.  If you don’t have one, or don’t have a good one, there are two reasons you need to learn:

  1. Approaching debugging systematically is the only way to get consistent results
  2. Understanding your algorithm explicitly is critical to teaching the skill to others, which grows in importance as you become a senior developer yourself.

I hope you’ll love this article, but you can also jump to the cheat sheet here.

Debugging as a Search Algorithm

Let’s start by defining debugging in terms of a class of algorithms you are already familiar with: search.

We can view the code as a tree or graph (functions) containing ordered lists (lines of code). We can additionally think of any given execution of the program as one big ordered list: all of the lines of code executed (at least for single-threaded software).

Defining the Search Space

As in any algorithm, the first step is to define the problem precisely.

First, define the problem clearly 

  • What is the expected behaviour?
  • How does this differ from what happens?

Second, restate that problem in terms of a search space:

  • What is the latest point in execution before things have gone bad*?
  • What is the earliest point in execution after things have gone bad*?

Ideally you can isolate the scope of the problem to a single module or method. Worst case, your program state was good before it started to execute and is bad at the first point where you observe the bug. 

*I’m using “gone bad” as short-had for “violated a relevant program invariant.” If you aren’t familiar with invariants, any good programming fundamentals book should cover it, or check out Wikipedia. The core idea is that we’re looking for the earliest point where something about the program state related to our bug is incorrect.

Binary Search

Given an ordered list, the one of the first algorithms that should come to mind is a binary search.

Binary search is simple and it converges in log(n) time. This means that binary search will allow you to find problems in a large code base almost as quickly as in a small one. 

Binary search does require that you have a sorted list and an efficient way to determine at any point whether you are before, at, or after the point you are looking for. It’s not obvious that this applies to most debugging problems, so let’s dig in and see how it works:

Firstly, you need to define your search space, as described above.

Now, bisect the problem. Identify a point in the execution approximately in the middle of your search space, where you will be able to determine if the bug has occurred yet or not. It is more important that you be able to confidently determine if the bug has occurred than that you make the decision exactly at the mid point. In familiar code bases you should be able to find such a point with a little thought, most of the time. 

Now, debug to the mid-point identified and determine if the bug has already occurred: has a state invariant been violated? If so, the problem is in the first half of your search space, otherwise it is in the second half.

Repeat a few times and you have solved the problem. 

Of course, you may not always know the code well enough to bisect the search space.

Breadth-First Search

When you aren’t familiar enough with the code to jump directly to a binary search, you can still avoid just tracing through the whole application by applying a breadth-first search. 

All but the absolute worst spaghetti code has at least some structure: functions (methods/procedures) that start at a high level and call into deeper levels.

Start from the beginning of your search space.

Read and debug/trace through that function to learn how it works. Do not trace into called functions at this point. As you understand the outermost function better, either:

  1. Identify the line/block where execution goes off the rails. In this case you have narrowed your search space to the code behind that function call. Or
  2. Develop a test for correctness near the middle of that function, possibly by drilling down just a little deeper near the middle. 

In either case you have now reduced your search space by at least half while probably debugging only a few dozen lines of code. Repeat this procedure iteratively until you have identified the root cause.

In some programming languages, or styles, the structure can vary a little. The key idea here is to first focus on the high level and identify, with confidence, at what point things go wrong. Then drill down.

Iterate (work the system)

Each of the techniques above will reduce your search space by at least half. The time to complete each iteration depends on the complexity and modularity of the code, but it’s usually 30-120 minutes. After a few iterations you’ll be narrowing in on problems in even the largest code bases.

Feel free to mix the breadth-first search (BFS) and binary searches. I’ve often gone back and forth. Perhaps I’m familiar with the overall workings of a module, so I’m able to binary search a few times before getting into a specific function I’m not familiar with, where I begin a BFS. Other times I may not be confident where the problem starts, so I use BFS to narrow it down, then learn it’s in familiar territory and switch back a binary search.

If you’re new to this system, or working on a really tough problem, get a notebook or piece of paper and write down each iteration. This can also be helpful if you’re frequently interrupted. 

Iteration Start TimeSearch Space StartSearch Space EndNotes
1
2
3

Edge cases

Here are a few questions I’ve answered about this process in the past.

Reality Check: Hunches

If you are familiar with a code base, you may have a good idea where the problem is. This can save a lot of time, and you should take advantage of your knowledge.  However, it can also be a trap. 

Jumping to the most likely problematic line of code, fix it, then test. This is satisfying when it works, but it’s less effective than it seems. When you don’t take the time to fully understand the problem before fixing it, two things can happen:

  1. Your first guess was wrong, so you try again, and again. This is more obviously not getting great results, but most people will attribute this to a tough problem, rather than their own poor skill.
  2. You fix a symptom rather than the root cause, or miss an edge case. This creates more work for the future. This is even more dangerous, since it has the superficial appearance of productivity.

Two techniques that can help mitigate these risks, without slowing you down much, are:

  1. Test Assumptions – whenever you complete a “quick fix” spend an extra 10 minutes to double-check that you really understand the problem and look for edge cases. Carefully read the surrounding code or review state in your debugger.
  2. Timebox – When debugging, always timebox your progress. If you haven’t unambiguously narrowed your problem scope in the last 1-2 hours, stop what you’re doing and take a more systematic approach.

Non Reproducible Bugs

The approach above assumes that you can reproduce the problem or bug at will.  However the same ideas apply to problems that are difficult or impossible to reproduce. 

In these cases, it becomes even more important to accurately define the problem and the expected behavior.  Without the ability to see the problem yourself, the risk of a misunderstanding is magnified. 

The breadth-first search technique extends to debugging by inspection. Understand the code and develop hypotheses for what could have solved the problem. In some cases you will find only one possibility – congratulations, you just solved the problem. If not, test them. 

Binary search also remains effective.  However, since your iteration time is likely measured in days or weeks rather than minutes, you may want to test more than one hypothesis at a time. In this case, it’s important that you have hypotheses and tests covering all possibilities. In particular, it is good to have tests that are guaranteed to narrow your search space even if your primary theory is wrong. 

Debugging tools

Setup your debugger and get familiar with its strengths and its quirks. There are very few environments where a good debugger is not available. However, there are a lot of developers who seem to think they don’t need one.

Debuggers don’t change the overall equation, the Big-O efficiency, of debugging, but they can be a huge multiplier. Variable inspection alone is at least twice as fast as adding log statements. It allows you to drill down in one pass if you realise you need a bit more detail. Many debuggers also allow you to change variables or even modify the code state or execution position live. These interactive tools can enable you to run tests in minutes that might have taken hours otherwise. 

Systematic Debugging Cheat Sheet

You can download a free cheat-sheet for this process here.

Put it into action

  1. Download the Cheat Sheet (print it if you’re old-school)
  2. Try the process next time you have a tough bug, and
  3. If you’re a pro already: try using this next time you’re helping a junior developer with tough problem

Discuss

Use the comments section below to share your successes, and let me know what you think. Do you have some great ideas I didn’t mention above? 

The forum is moderated and your first post may not appear until it has been manually approved. 

3 thoughts on “Systematic Debugging”

  1. If something is broken and you know it use to work, git can help you use the bisect approach to debug as well. You can use “git bisect” and specify the last working revision, and it will start a binary search on commit history and help you narrow down exactly where things went wrong.

    • Great point Rob!
      Revision-history-search is a good way to track down recent regressions. It also doesn’t depend on a deep knowledge of the code and narrows the problem down to one commit in O(log(n)) time, if done correctly. Perhaps I’ll have to update the cheat sheet soon!

      That said, it does have a dark side, both powerful and seductive: It’s the easiest debugging technique there is, but it’s not always the fastest.

      I admit to joining the dark side myself a time or two: checkout, build, grab a coffee, test, repeat. I needed a full build since incremental ones weren’t 100% reliable when rolling back…. more time for coffee.

      The two big factors are how long the “checkout-build-test” loop takes, and how long you think it would take to converge using the direct debugging approach. This approach definitely has it’s place in the arsenal.

  2. The checkout-build-test loop is definitely slow, so it’s not the first technique I’d use. The last time I resorted to git bisect, I was staring at some broken code and it seemed nothing had changed since the last working revision. Both copies of the relevant code looked exactly the same, so what gives?

    I eventually narrowed it down to a version change on a dependent libraries that was sharing with another component.

Leave a Comment