My favorite bug: segfaults in Java (redux)

Two years ago, I wrote about one of my favorite bugs that I’d squashed two years before that. About a year after that, someone posted it on Hacker News.

There was some fun discussion about it, but also some confusion. After finishing a season of mentoring team 4272, I’ve decided that it would be fun to re-visit the article, and dig up the old actual code, instead of pseudo-code, hopefully improving the clarity (and providing a light introduction for anyone wanting to get into modifying the current SmartDashbaord).

The context

In 2012, I was a high school senior, and lead programmer programmer on the FIRST Robotics Competition team 1024. For the unfamiliar, the relevant part of the setup is that there are 2 minute and 15 second matches in which you have a 120 pound robot that sometimes runs autonomously, and sometimes is controlled over WiFi from a person at a laptop running stock “driver station” software and modifiable “dashboard” software.

That year, we mostly used the dashboard software to allow the human driver and operator to monitor sensors on the robot, one of them being a video feed from a web-cam mounted on it. This was really easy because the new standard dashboard program had a click-and drag interface to add stock widgets; you just had to make sure the code on the robot was actually sending the data.

That’s great, until when debugging things, the dashboard would suddenly vanish. If it was run manually from a terminal (instead of letting the driver station software launch it), you would see a core dump indicating a segmentation fault.

This wasn’t just us either; I spoke with people on other teams, everyone who was streaming video had this issue. But, because it only happened every couple of minutes, and a match is only 2:15, it didn’t need to run very long, they just crossed their fingers and hoped it didn’t happen during a match.

The dashboard was written in Java, and the source was available (under a 3-clause BSD license) via read-only SVN at http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk (which is unfortunately no longer online, fortunately I’d posted some snapshots on the web). So I dove in, hunting for the bug.

The repository was divided into several NetBeans projects (not exhaustively listed):

client/smartdashboard: The main dashboard program, has a plugin architecture.
WPIJavaCV: A higher-level wrapper around JavaCV, itself a Java Native Interface (JNI) wrapper to talk to OpenCV (C and C++).
extensions/camera/WPICameraExtension: The standard camera feed plugin, processes the video through WPIJavaCV.

I figured that the bug must be somewhere in the C or C++ code that was being called by JavaCV, because that’s the language where segfaults happen. It was especially a pain to track down the pointers that were causing the issue, because it was hard with native debuggers to see through all of the JVM stuff to the OpenCV code, and the OpenCV stuff is opaque to Java debuggers.

Eventually the issue lead me back into the WPICameraExtension, then into WPIJavaCV—there was a native pointer being stored in a Java variable; Java code called the native routine to free() the structure, but then tried to feed it to another routine later. This lead to difficulty again—tracking objects with Java debuggers was hard because they don’t expect the program to suddenly segfault; it’s Java code, Java doesn’t segfault, it throws exceptions!

With the help of println() I was eventually able to see that some code was executing in an order that straight didn’t make sense.

The bug

The basic flow of WPIJavaCV is you have a WPICamera, and you call .getNewImage() on it, which gives you a WPIImage, which you could do all kinds of fancy OpenCV things on, but then ultimately call .getBufferedImage(), which gives you a java.awt.image.BufferedImage that you can pass to Swing to draw on the screen. You do this every for frame. Which is exactly what WPICameraExtension.java did, except that “all kinds of fancy OpenCV things” consisted only of:

public WPIImage processImage(WPIColorImage rawImage) {
    return rawImage;
}

The idea was that you would extend the class, overriding that one method, if you wanted to do anything fancy.

One of the neat things about WPIJavaCV was that every OpenCV object class extended had a finalize() method (via inheriting from the abstract class WPIDisposable) that freed the underlying C/C++ memory, so you didn’t have to worry about memory leaks like in plain JavaCV. To inherit from WPIDisposable, you had to write a disposed() method that actually freed the memory. This was better than writing finalize() directly, because it did some safety with NULL pointers and idempotency if you wanted to manually free something early.

Now, edu.wpi.first.WPIImage.disposed() called com.googlecode.javacv.cpp.opencv_core.IplImage.release(), which called (via JNI) IplImage:::release(), which called libc free():

@Override
protected void disposed() {
    image.release();
}

Elsewhere, the C buffer for the image was copied into a Java buffer via a similar chain kicked off by edu.wpi.first.WPIImage.getBufferedImage():

/**
 * Copies this {@link WPIImage} into a {@link BufferedImage}.
 * This method will always generate a new image.
 * @return a copy of the image
 */
public BufferedImage getBufferedImage() {
    validateDisposed();

    return image.getBufferedImage();
}

The println() output I saw that didn’t make sense was that someFrame.finalize() was running before someFrame.getBuffereImage() had returned!

You see, if it is waiting for the return value of a method m() of object a, and code in m() that is yet to be executed doesn’t access any other methods or properties of a, then it will go ahead and consider a eligible for garbage collection before m() has finished running.

Put another way, this is passed to a method just like any other argument. If a method is done accessing this, then it’s “safe” for the JVM to go ahead and garbage collect it.

That is normally a safe “optimization” to make… except for when a destructor method (finalize()) is defined for the object; the destructor can have side effects, and Java has no way to know whether it is safe for them to happen before m() has finished running.

I’m not entirely sure if this is a “bug” in the compiler or the language specification, but I do believe that it’s broken behavior.

Anyway, in this case it’s unsafe with WPI’s code.

My work-around

My work-around was to change this function in WPIImage:

public BufferedImage getBufferedImage() {
    validateDisposed();

    return image.getBufferedImage(); // `this` may get garbage collected before it returns!
}

In the above code, this is a WPIImage, and it may get garbage collected between the time that image.getBufferedImage() is dispatched, and the time that image.getBufferedImage() accesses native memory. When it is garbage collected, it calls image.release(), which free()s that native memory. That seems pretty unlikely to happen; that’s a very small gap of time. However, running 30 times a second, eventually bad luck with the garbage collector happens, and the program crashes.

The work-around was to insert a bogus call to this to keep this around until after we were also done with image:

to this:

public BufferedImage getBufferedImage() {
    validateDisposed();
    BufferedImage ret = image.getBufferedImage();
    getWidth(); // bogus call to keep `this` around
    return ret;
}

Yeah. After spending weeks wading through though thousands of lines of Java, C, and C++, a bogus call to a method I didn’t care about was the fix.

TheLoneWolfling on Hacker News noted that they’d be worried about the JVM optimizing out the call to getWidth(). I’m not, because WPIImage.getWidth() calls IplImage.width(), which is declared as native; the JVM must run it because it might have side effects. On the other hand, looking back, I think I just shrunk the window for things to go wrong: it may be possible for the garbage collection to trigger in the time between getWidth() being dispatched and width() running. Perhaps there was something in the C/C++ code that made it safe, I don’t recall, and don’t care quite enough to dig into OpenCV internals again. Or perhaps I’m mis-remembering the fix (which I don’t actually have a file of), and I called some other method that could get optimized out (though I do believe that it was either getWidth() or getHeight()).

WPI’s fix

Four years later, the SmartDashboard is still being used! But it no longer has this bug, and it’s not using my workaround. So, how did the WPILib developers fix it?

Well, the code now lives in git at collab.net, so I decided to take a look.

The stripped out WPIJavaCV from the main video feed widget, and now use a purely Java implementation of MPJPEG streaming.

However, the old video feed widget is still available as an extension (so that you can still do cool things with processImage), and it also no longer has this bug. Their fix was to put a mutex around all accesses to image, which should have been the obvious solution to me.