summaryrefslogtreecommitdiff
path: root/public/java-segfault.md
blob: fbffb5256d79f60be38032bc8b72b970a935b087 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
My favorite bug: segfaults in Java
==================================
---
date: "2014-01-13"
---

> Update: Two years later, I wrote a more detailed version of this
> article:
> [My favorite bug: segfaults in Java (redux)](./java-segfault-redux.html).

I've told this story orally a number of times, but realized that I
have never written it down.  This is my favorite bug story; it might
not be my hardest bug, but it is the one I most like to tell.

The context
-----------

In 2012, I was a Senior programmer on the FIRST Robotics Competition
team 1024.  For the unfamiliar, the relevant part of the setup is that
there are 2 minute and 15 second matches in which you have a 120 pound
robot that sometimes runs autonomously, and sometimes is controlled
over WiFi from a person at a laptop running stock "driver station"
software and modifiable "dashboard" software.

That year, we mostly used the dashboard software to allow the human
driver and operator to monitor sensors on the robot, one of them being
a video feed from a web-cam mounted on it.  This was really easy
because the new standard dashboard program had a click-and drag
interface to add stock widgets; you just had to make sure the code on
the robot was actually sending the data.

That's great, until when debugging things, the dashboard would
suddenly vanish.  If it was run manually from a terminal (instead of
letting the driver station software launch it), you would see a core
dump indicating a segmentation fault.

This wasn't just us either; I spoke with people on other teams,
everyone who was streaming video had this issue.  But, because it only
happened every couple of minutes, and a match is only 2:15, it didn't
need to run very long, they just crossed their fingers and hoped it
didn't happen during a match.

The dashboard was written in Java, and the source was available (under
a 3-clause BSD license), so I dove in, hunting for the bug.  Now, the
program did use Java Native Interface to talk to OpenCV, which the
video ran through; so I figured that it must be a bug in the C/C++
code that was being called.  It was especially a pain to track down
the pointers that were causing the issue, because it was hard with
native debuggers to see through all of the JVM stuff to the OpenCV
code, and the OpenCV stuff is opaque to Java debuggers.

Eventually the issue lead me back into the Java code---there was a
native pointer being stored in a Java variable; Java code called the
native routine to `free()` the structure, but then tried to feed it to
another routine later.  This lead to difficulty again---tracking
objects with Java debuggers was hard because they don't expect the
program to suddenly segfault; it's Java code, Java doesn't segfault,
it throws exceptions!

With the help of `println()` I was eventually able to see that some
code was executing in an order that straight didn't make sense.

The bug
-------

The issue was that Java was making an unsafe optimization (I never
bothered to figure out if it is the compiler or the JVM making the
mistake, I was satisfied once I had a work-around).

Java was doing something similar to tail-call optimization with regard
to garbage collection.  You see, if it is waiting for the return value
of a method `m()` of object `o`, and code in `m()` that is yet to be
executed doesn't access any other methods or properties of `o`, then
it will go ahead and consider `o` eligible for garbage collection
before `m()` has finished running.

That is normally a safe optimization to make… except for when a
destructor method (`finalize()`) is defined for the object; the
destructor can have side effects, and Java has no way to know whether
it is safe for them to happen before `m()` has finished running.

The work-around
---------------

The routine that the segmentation fault was occurring in was something
like:

	public type1 getFrame() {
		type2 child = this.getChild();
		type3 var = this.something();
		// `this` may now be garbage collected
		return child.somethingElse(var); // segfault comes here
	}

Where the destructor method of `this` calls a method that will
`free()` native memory that is also accessed by `child`; if `this` is
garbage collected before `child.somethingElse()` runs, the backing
native code will try to access memory that has been `free()`ed, and
receive a segmentation fault.  That usually didn't happen, as the
routines were pretty fast.  However, running 30 times a second,
eventually bad luck with the garbage collector happens, and the
program crashes.

The work-around was to insert a bogus call to this to keep `this`
around until after we were also done with `child`:

	public type1 getFrame() {
		type2 child = this.getChild();
		type3 var = this.something();
		type1 ret = child.somethingElse(var);
		this.getSize(); // bogus call to keep `this` around
		return ret;
	}

Yeah.  After spending weeks wading through though thousands of lines
of Java, C, and C++, a bogus call to a method I didn't care about was
the fix.