2009-05-18

Spirit Sol 133

They found out what caused yestersol's unexpected reboot. It resulted from a bug in the flight software, a type of bug called a "race condition." It happens when two tasks try to use a shared resource in an order the programmer didn't expect. Every once in a while, they interact in an undesirable order -- "racing" for the prize, as it were -- and something goes wrong. In this case, one task running on the rover tried to make a note that the morning bootup should be considered successful. At the same time, another task tried to log a routine note about the spacecraft's operation.

The problem is that both tasks write to a memory area that's normally write-protected, and they don't do it correctly. The first task does it properly -- politely waiting for exclusive access to the shared memory area, unlocking the area, making its note, relocking the area, and then releasing the memory area for use by other tasks. But the second task isn't so polite; it wrongly goes ahead and unlocks/relocks the shared area without first waiting for exclusive access. Normally, this just happens to work -- the vulnerability window is narrow, and we just happen to get away with it. But not this time. This time, our luck ran out, and the bug was exposed: the second task relocked the memory area just before the first task could write to it. Trying to write to a locked (write-protected) memory area causes a fatal error that results in a spacecraft reset.

And what are we going to do about it? Maybe nothing. As Bob Denise, who's filling me in on the details, points out, the two rovers have now been booted a combined total of about 1000 times, and this bug has been exposed only once. To a first approximation, this suggests we'll see this bug once every thousand times we boot the rovers -- so we might well just never see it again.[1] We might create and upload a software patch (which can always introduce unforeseen problems of its own), or we might just live with the apparently small risk of occasionally losing a sol.

In any case, we're proceeding normally thisol. A drive. A good long one, too -- they end up throwing out some science to make more time for it, and it ends up totaling about 3.5 hours of driving. In theory, we could make about 150m, but we'll probably cover more like 100m, possibly less. Still, we'll cover a good chunk of ground, which will help make up for the two days we effectively lost to the anomaly.

The planning for this makes a hash of the SOWG meeting. I'm not really tracking it, because the details they're worrying over don't concern me. Larry Soderblom's the SOWG chair thisol, and he seems to be thrashing a bit as he tries to make sense of all of the science requests, so that he can discard just the right set. I presume he makes the right calls in the end, though I wouldn't want to be in his shoes today.

I haven't been RP-1 in a long time, and today I remember some of the things I haven't missed about it. The main thing is the interruptions -- oh, Lord, the interruptions. I'm trying to evaluate the terrain, choose a path that keeps all parts of the rover's body safe and respects all of its mechanical limits while making efficient progress toward our goal, compensate for limitations in our tools, document my work as I go, make sure I've thought of everything that can possibly go wrong, all of this under time pressure ... and every thirty seconds, someone has an urgent question for me. Are we driving backward again thisol? Is there a parameter we can tweak to cut down on roverdancing? What azimuth should we use for the drive-direction PANCAMs? Have you finished editing the slice file yet? How long are we going to spend on the blind drive? Daddy, why is the sky blue?

I miss a lot of stuff about being an RP-1. But not that part.[2]

I pick out a simple path that takes us more or less where we were going before we were so rudely interrupted. We'll scoot straight ahead about 25m, climbing up onto the next ridge. After that, we'll turn on autonav. But while I'm planning that, Ashitey works out another solution, zig-zagging around some local hollows and cresting the ridge at a lower point off to the right. He shows me his idea and I like it. His solution is better because we can see the terrain beyond the ridge at that point a little more clearly -- with my approach, autonav might get stuck trying to descend the ridge, just as it did a few sols ago, but with Ashitey's, we can have some confidence that autonav will make real progress. The only downside is that his blind-drive segment is shorter, only about 16m, and not as direct. I think about it for a minute and then adopt his path; given the timings, since it improves our chances of a successful autonav, we should make more progress for the day with his solution, even if the first part is less efficient. After looking at it a while, I find a way to extend the blind drive to 34m, making that approach a winner all around.

We were originally planning to do the blind drive backward again. But while reviewing the sequence as part of our handover, Ashitey and I notice an error (which went undetected last time we tried this -- oops). Normally, at the end of the blind drive, we tell the rover to take a couple of pictures of the region in front of it; this helps ensure that it will understand the area it's in at the end of the sol even if the autonav doesn't go anywhere. The problem is, when driving backward, we should tell the rover to take pictures of the area behind the rover, not in front. It's an easy fix, and not a big deal even if we don't fix it, but it makes us nervous about what else we might have overlooked. So we turn the rover around and have it drive forward instead, as it normally does. We already proved our point about driving backward anyway; we made about 14m of progress (regress?) before the anomaly stopped us the other sol. We have little to gain by doing it again, and too much to lose. So we don't.

I also simplify our day in another way. Julie raises a concern about wasting energy roverdancing when the rover isn't going to make any progress anyway. Is there something we can change to make it give up sooner? I suggest replacing the single long-distance autonav waypoint with a series of shorter autonav waypoints, but we'd have to revisit the timeout-handling logic. Maybe there are flight software parameters we could tweak to help out as well, but we'd have to ask Mark Maimone and he's already gone home. And you know what -- we just don't have time to deal with it today. If we've learned anything, it's that we shouldn't try to come up with this stuff under time pressure unless we really have to. So I tell her we'll think about it and not do it today, and she's cool with that. So we don't.

I need to blow off a little steam after this day, so I put something in the uplink report that I've been thinking about for a while:


Driving a Mars rover ain't like dustin' crops, boy. Without
precise calculations, we'd drive into a ditch or bounce too
close to a 60cm rock, and that would end your trip real quick,
wouldn't it?!


Anybody who doesn't get it has no business working here anyway, I figure.





Footnotes:

[1] Of course, we have seen that bug again, a few times. We continue to just live with it; it's not a big enough problem to be worth the risks involved in fixing it. I'd like to add, though, that I'm permanently amazed by the team's ability to diagnose these problems -- in this case, by the next day -- from a hundred million miles away. Amazed.

[2] Sometimes it still feels like that, although the protocol is now very much more that it's OK to tell people to bug off and you'll get back to them when you're good and ready. (It helps to not phrase it like that, although some days ....) This is a good change. For a long time, it felt like I was the dad in Vonnegut's story Harrison Bergeron -- the guy who has to wear a device that makes a loud noise in his ear every few seconds, so that he won't be able to unfairly exercise his higher-than-average intelligence.

1 comment: