When was the last time you had a bug and you were clueless on how to fix the issue ? Recently, I was in the same situation and ran out of ideas. I believe that the reason for this situation was my lack of knowledge in the problem domain that I was working on. Let me explain you the problem and the way I found out the solution. It will give you more idea on my sufferings. Btw it took me 4 days to figure out the solution :D.

 

Problem background:

I was implementing an algorithm [1] on a big data platform Apache Hama. The domain of the algorithm is Energy Informatics and uses Convex Optimization and ADMM [2] to make the problem distributed. Let’s leave the complexity a side and move directly to the main points.

I had to implement 2 main equations from [1].

Aggregator equaton

Electric vehicle equation

The goal of equation 20 is find the min value of x_n.  p^T, delta T, x_n^k, x mean, u^k are vectors of length 96. x_n is also a vector of length 96.

The goal of equation 26 is to find the min value of x_i. gamma, rho and alpha are constants. Everything else is a vector of length 96.

The distributed algorithm looks something like this. Image taken from [1].

EV ADMM algorithm

In the above diagram, Aggregator and EV’s (Electric Vehicle) represent a separate machine. The algorithm works like

  1. Aggregator initializes some vectors (u,x) and sends them to different EV machines.  We call this incentive signal.
  2. Each EV machine solves its equation (Equation 26 shown above).
  3. While EV’s are solving their equations, Aggregator solves its equation (shown in equation 20).
  4. Each EV returns the result back to the Aggregator.
  5. Aggregator picks up all the results and aggregates them.
  6. It updates the incentive signals (u,x)
  7. Aggregator sends the updated values to each machine and process continues until the optimal solution is achieved.

Now, you have a good idea of the problem. Lets see what issues I got that took me 4 days to fix.

Problem:

Initially, I ran the algorithm on 1 iteration only. I got a few logical errors and I fixed them in no time. Then I ran the algorithm for 2 iterations and my output was correct. To verify my output, I was comparing my output with the MATLAB implementation of the same algorithm but MATLAB code was serial but answers were pretty close.

Then I ran it for 10 iterations and my answer was way off. I was not able to understand why. I started to increase my iterations one by one. The output of 3rd iteration was also correct. But the output of 4th iteration was wrong. I was confused that how can it happen that the output of first 3 iterations is correct but the 4th is wrong. I thought maybe the output of last iteration is always wrong. Ran the algorithm to 5th iteration but the answer of 4th was still wrong.

Note: Output of each iteration depends the output of previous iteration. So, 5th iteration output was automatically wrong.

Clue 1: So, I knew that something goes wrong in 4th iteration.

Since, the output of each iteration depends on others, I thought that something must be going wrong in previous iterations. I did the following steps

  1. I checked all of my utility functions like vectorAdd, vectorScalerMultiply, vectorScalerAdd, vectorNorm, vectorMean and others. Everything looked fine.
  2. I checked my algorithm logic and everything also looked fine.

At this point, I was annoyed that everything looks fine and yet my answers are wrong. I went on console printing spree and printed everything that was going on the console in eclipse. The problem was I had too much data in console to understand everything. For example, each vector had 96 double values and in each iteration, I had like 6-7 different vectors. So, it was a mess. Further to complicate things, I was in a distributed mode and each machine was printing to the same console. (I was mimicking the distributed mode by running multiple separate tasks in the same machine). Check out part of my console.

EV ADMM Java console output

EV ADMM Java console output

So instead of printing the vectors on the console, I printed the sum of all the values in a vector. So, I reduced my output from 96 double values to a single double value for each vector. So, I printed everything including output and input to different equations. The combined output looked something like

EV ADMM averaged console output

EV ADMM averaged console output

local:X –> Represents an EV machine output. In my case I had 3 EV machines

M:=>local:0 –> Represents aggregator machine output.

Even though this approach is not the best but it gave me a very good overall understanding of all the data in my program. I generated a similar aggregated output from MATLAB code also. After comparison I got my second clue.

Clue #2: The input data to equation 20 was similar to MATLAB in 4th iteration but the output was wrong. So, something is going wrong in solving the equation 20. 

4th Iteration data comparison

4th Iteration data comparison

You can see in the above picture that the there is a difference between outputs of MATLAB and JAVA code. So, I focused just on this.

The problem here was that to solve the equations, I was using a different solver for MATLAB and JAVA. My first guess was that the solvers are buggy but which one ??

For JAVA -> CPLEX [3]

 

For MATLAB -> CVX [4]

 

Clue 3: Solver is buggy. Somehow verify the output

The only solution to that was to introduce a third solver. Generate the output and compare it with the output of other 2. So, I solved the equation using YALMIP solver [5] in MATLAB.

The output of YALMIP and CVX were similar. This means that something is wrong with JAVA solver CPLEX.

Clue#4: JAVA solver CPLEX is giving the wrong output.

Ok the output of JAVA solver is wrong but I just pass on some parameters and what CPLEX does internally is not in my control. So, what to do ?? I was clueless because the input was similar in MATLAB and JAVA and only the output was different. I had no idea what to do and was going to ask my professor for help. But an idea struck me. Maybe, just maybe, if I can somehow output the models generated by MATLAB and JAVA and compare them, I might find something. But how ?

Clue#5: Models generated by the solvers might be different even though everything else looks similar

I did some research online and luckily found some methods in both CPLEX and YALMIP. So from YALMIP, I generated a model for CPLEX. And from CPLEX, I generated a model for CPLEX. Both should be similar ideally.

YALMIP export method:

CPLEX export method:

So, I generated the models in textual format. They look something like

Pretty cool ??. Now, my job was to compare the files and look for a bug. I noticed something that YALMIP solver was defining some bounds on the equation like

but the bounds in CPLEX looked very different like

Note: Bounds are different from constraints. I was setting the -60 and 100000 constraints in JAVA. Bounds tell the solver that what is the min and max value of the variables that we are trying to find.

Clue#6: Something is wrong with bounds

The bound generated by CPLEX looked really familiar because it was Double.MIN_VALUE. So, where I have used it in my code ??. I found it, I used it in my optimize() method to define the bounds of my vector.

OH CRAP !

Changed it to the logically correct bounds.

 

VOILA !!!!

My code was fixed. This was the bug.

Who would have thought that a simple boundary condition can mess up the code so bad that it took multiple days just to figure out what was happening. I was really happy to finally figure out the issue because it was really important to me.

Btw why the output of 4th iteration was wrong and not the others ? 

I do not have a clear answer to this but my guess is that the output was also wrong for the first 3 iterations but the difference was not noticeable.

 

So, I being a computer scientist sometimes requires detective level debugging. It might cost you days but the journey is worth it. I learned so many extra stuff that otherwise I might not have tried.

Drop a comment, if you like this or you have been in a similar situation.

References:

[1] – https://mediatum.ub.tum.de/doc/1187583/1187583.pdf

[2] – https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[3] – http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

[4] – http://cvxr.com/cvx/

[5] – http://users.isy.liu.se/johanl/yalmip/