My favorite bug: segfaults in Java (redux)
Two years ago, I wrote about one of my favorite bugs that I’d squashed two years before that. About a year after that, someone posted it on Hacker News.
There was some fun discussion about it, but also some confusion. After finishing a season of mentoring team 4272, I’ve decided that it would be fun to re-visit the article, and dig up the old actual code, instead of pseudo-code, hopefully improving the clarity (and providing a light introduction for anyone wanting to get into modifying the current SmartDashbaord).
The context
In 2012, I was a high school senior, and lead programmer programmer on the FIRST Robotics Competition team 1024. For the unfamiliar, the relevant part of the setup is that there are 2 minute and 15 second matches in which you have a 120 pound robot that sometimes runs autonomously, and sometimes is controlled over WiFi from a person at a laptop running stock “driver station” software and modifiable “dashboard” software.
That year, we mostly used the dashboard software to allow the human driver and operator to monitor sensors on the robot, one of them being a video feed from a web-cam mounted on it. This was really easy because the new standard dashboard program had a click-and drag interface to add stock widgets; you just had to make sure the code on the robot was actually sending the data.
That’s great, until when debugging things, the dashboard would suddenly vanish. If it was run manually from a terminal (instead of letting the driver station software launch it), you would see a core dump indicating a segmentation fault.
This wasn’t just us either; I spoke with people on other teams, everyone who was streaming video had this issue. But, because it only happened every couple of minutes, and a match is only 2:15, it didn’t need to run very long, they just crossed their fingers and hoped it didn’t happen during a match.
The dashboard was written in Java, and the source was available
(under a 3-clause BSD license) via read-only SVN at
http://firstforge.wpi.edu/svn/repos/smart_dashboard/trunk
(which is unfortunately no longer online, fortunately I’d posted some
snapshots on the web). So I dove in, hunting for the bug.
The repository was divided into several NetBeans projects (not exhaustively listed):
client/smartdashboard
: The main dashboard program, has a plugin architecture.WPIJavaCV
: A higher-level wrapper around JavaCV, itself a Java Native Interface (JNI) wrapper to talk to OpenCV (C and C++).extensions/camera/WPICameraExtension
: The standard camera feed plugin, processes the video through WPIJavaCV.
I figured that the bug must be somewhere in the C or C++ code that was being called by JavaCV, because that’s the language where segfaults happen. It was especially a pain to track down the pointers that were causing the issue, because it was hard with native debuggers to see through all of the JVM stuff to the OpenCV code, and the OpenCV stuff is opaque to Java debuggers.
Eventually the issue lead me back into the WPICameraExtension, then
into WPIJavaCV—there was a native pointer being stored in a Java
variable; Java code called the native routine to free()
the
structure, but then tried to feed it to another routine later. This lead
to difficulty again—tracking objects with Java debuggers was hard
because they don’t expect the program to suddenly segfault; it’s Java
code, Java doesn’t segfault, it throws exceptions!
With the help of println()
I was eventually able to see
that some code was executing in an order that straight didn’t make
sense.
The bug
The basic flow of WPIJavaCV is you have a WPICamera
, and
you call .getNewImage()
on it, which gives you a
WPIImage
, which you could do all kinds of fancy OpenCV
things on, but then ultimately call .getBufferedImage()
,
which gives you a java.awt.image.BufferedImage
that you can
pass to Swing to draw on the screen. You do this every for frame. Which
is exactly what WPICameraExtension.java
did, except that
“all kinds of fancy OpenCV things” consisted only of:
public WPIImage processImage(WPIColorImage rawImage) {
return rawImage;
}
The idea was that you would extend the class, overriding that one method, if you wanted to do anything fancy.
One of the neat things about WPIJavaCV was that every OpenCV object
class extended had a finalize()
method (via inheriting from
the abstract class WPIDisposable
) that freed the underlying
C/C++ memory, so you didn’t have to worry about memory leaks like in
plain JavaCV. To inherit from WPIDisposable
, you had to
write a disposed()
method that actually freed the memory.
This was better than writing finalize()
directly, because
it did some safety with NULL pointers and idempotency if you wanted to
manually free something early.
Now, edu.wpi.first.WPIImage.disposed()
called com.googlecode.javacv.cpp.opencv_core.IplImage.release()
,
which called (via JNI) IplImage:::release()
, which called
libc free()
:
@Override
protected void disposed() {
image.release();
}
Elsewhere, the C buffer for the image was copied into a Java buffer
via a similar chain kicked off by
edu.wpi.first.WPIImage.getBufferedImage()
:
/**
* Copies this {@link WPIImage} into a {@link BufferedImage}.
* This method will always generate a new image.
* @return a copy of the image
*/
public BufferedImage getBufferedImage() {
validateDisposed();
return image.getBufferedImage();
}
The println()
output I saw that didn’t make sense was
that someFrame.finalize()
was running before
someFrame.getBuffereImage()
had returned!
You see, if it is waiting for the return value of a method
m()
of object a
, and code in m()
that is yet to be executed doesn’t access any other methods or
properties of a
, then it will go ahead and consider
a
eligible for garbage collection before m()
has finished running.
Put another way, this
is passed to a method just like
any other argument. If a method is done accessing this
,
then it’s “safe” for the JVM to go ahead and garbage collect it.
That is normally a safe “optimization” to make… except for when a
destructor method (finalize()
) is defined for the object;
the destructor can have side effects, and Java has no way to know
whether it is safe for them to happen before m()
has
finished running.
I’m not entirely sure if this is a “bug” in the compiler or the language specification, but I do believe that it’s broken behavior.
Anyway, in this case it’s unsafe with WPI’s code.
My work-around
My work-around was to change this function in
WPIImage
:
public BufferedImage getBufferedImage() {
validateDisposed();
return image.getBufferedImage(); // `this` may get garbage collected before it returns!
}
In the above code, this
is a WPIImage
, and
it may get garbage collected between the time that
image.getBufferedImage()
is dispatched, and the time that
image.getBufferedImage()
accesses native memory. When it is
garbage collected, it calls image.release()
, which
free()
s that native memory. That seems pretty unlikely to
happen; that’s a very small gap of time. However, running 30 times a
second, eventually bad luck with the garbage collector happens, and the
program crashes.
The work-around was to insert a bogus call to this to keep
this
around until after we were also done with
image
:
to this:
public BufferedImage getBufferedImage() {
validateDisposed();
BufferedImage ret = image.getBufferedImage();
getWidth(); // bogus call to keep `this` around
return ret;
}
Yeah. After spending weeks wading through though thousands of lines of Java, C, and C++, a bogus call to a method I didn’t care about was the fix.
TheLoneWolfling on Hacker News noted that they’d be worried about the
JVM optimizing out the call to getWidth()
. I’m not, because
WPIImage.getWidth()
calls IplImage.width()
,
which is declared as native
; the JVM must run it because it
might have side effects. On the other hand, looking back, I think I just
shrunk the window for things to go wrong: it may be possible for the
garbage collection to trigger in the time between
getWidth()
being dispatched and width()
running. Perhaps there was something in the C/C++ code that made it
safe, I don’t recall, and don’t care quite enough to dig into OpenCV
internals again. Or perhaps I’m mis-remembering the fix (which I don’t
actually have a file of), and I called some other method that
could get optimized out (though I do believe that it
was either getWidth()
or getHeight()
).
WPI’s fix
Four years later, the SmartDashboard is still being used! But it no longer has this bug, and it’s not using my workaround. So, how did the WPILib developers fix it?
Well, the code now lives in git at collab.net, so I decided to take a look.
The stripped out WPIJavaCV from the main video feed widget, and now use a purely Java implementation of MPJPEG streaming.
However, the old video feed widget is still available as an extension
(so that you can still do cool things with processImage
),
and it also no longer has this bug. Their fix was to put a mutex around
all accesses to image
, which should have been the obvious
solution to me.