2009-05-27

Spirit Sol 142

Of course, they've corrected the anomaly. Once again, it was not my fault. Surprisingly, it was another race condition, this one in the imaging subsystem: a command to take an image was dispatched at almost the same instant as a command to stop imaging. This caused the spacecraft to try to write to the camera interface hardware when the interface hardware was turned off, which triggers a self-protective reboot. The window of vulnerability was all of an eighth of a second wide, which helps explain why we hadn't seen this before -- it's another one of those one-in-a-million (well, maybe one-in-a-few-hundred) things.

In his Mission Manager Report, summarizing the anomaly and the resulting investigation, Mark Adler wrote:


"This was the second 'very low probability' fatal in less than
a week. After verifying the theory, the flight software team
then went out to buy lottery tickets."


Then, in the "Concerns" section of his report, Adler wrote:


"Concerns: I'm not sure that the flight software guys are
clear on the 'very low probability' concept."


Well, the rover drivers on shift (John and Ashitey) had a few easy days. Due to the anomaly, they were off for at least one of those days, and then spent yesterday and the day before (sols 140 and 141) just re-running sequences that had already been built for sols 137 and 138 -- doing the stuff that didn't happen because of the anomaly. Today we're all caught up from the anomaly and are driving away -- a simple drive, if a short one.

Art, as usual, is spending his time kidding around. With a mix of rue and pride, he describes the Pathfinder-era Thompson Loop -- the combination of a sequencing error (in a sequence Art wrote) and flight software behavior that seemed it might cause us to lose the Sojourner rover. Since both the last Spirit anomaly and this one happened to occur on sols Art planned -- and, in each case, his last day on-shift before going off-shift (just like today) -- he tells Mark Adler to expect the TWA (Thompson Weekly Anomaly) to occur tomorrow. "Keep that in mind," Adler says, "someday HQ is going to ask us how do we kill these things." "Put Thompson on it," several people say at once.

6 comments:

george said...

Ok, now that Spirit & Opportunity 's cousin has a good name - Curiosity - i'm curious to know if you are going to partecipate and drive next rover mission ... just to assure longest life to "Mars and Me" great blog ...

Ermanno said...

Was it too difficult to implement a lock mechanism in the software?
I don't know (but you do ;-)) if there were other race conditions hidden in the software, but the burden of implementing a lock surely would have been worth it.
IMO, of course.
Ciao,
Ermanno

Scott Maxwell said...

@george The rover drivers for MSL (Curiosity) haven't been selected yet. They have bigger problems at the moment! However that ends up working out, I appreciate the compliment on my blog.

@ermanno I didn't work on the flight software, but I'll pass along my understanding. There is a locking mechanism, it's just that sometimes a developer would misuse it and introduce a bug like the one that bit us here. The locking was explicit (that is, not implemented automatically), and there was no automated code checker I'm aware of, making this sort of bug inevitable -- happily uncommon, thanks to the talent level of the developers, but still inevitable.

I don't know what, if anything, is being done differently on MSL flight software development.

Ermanno said...

@Scott > There is a locking mechanism, it's just that sometimes a developer would misuse [snip]
there was no automated code checker I'm aware of [snip]

I understand..
It's just like knocking at a door and waiting for an answer vs. smashing it and entering.

Ciao,
Ermanno

Ed Davies said...

I'm a little surprised that there wasn't an automated code checker. Normally throughout the software industry we think well, yeah, there could be problems in this software but really mission critical software is better checked than this. Isn't it?

In fact, the whole way you do software development surprises me a little. You talk of making fairly quick changes to software which is directly involved in generating sequences sent to the spacecraft. I'd have expected changes to that sort of thing to go through layers of review before it's used operationally. Am I kidding myself?

Scott Maxwell said...

@Ed Well, as I said, I wasn't on the flight software team and am not certain what their practices were. I don't think they used a static analyzer, but I don't really know.

I should also point out, in case this isn't clear, that the flight software (software that runs on the rover itself) and ground software (which runs on our computers here on Earth) are competely separate efforts, subject to appropriately different levels of review.

On the ground software, we go through phases where we have both an "engineering version" (which might have badly needed new features, but are not as well tested) and an official version we can fall back to if there's something wrong with the engineering version. After a while, the engineering version becomes the official version.

To further enable quick turnaround, the sequencing software I wrote (RoSE) is extensively self-tested -- it sounds like you're already familiar with techniques like this, but essentially, it's filled with self-checking code that I can run after each change. (The self-tests are also automatically run every night, with any failures emailed to me while the offending changes are still fresh in my mind.) This was something of an investment, but it has paid off beautifully, helping me make extensive changes with (justified) confidence, even on relatively short notice. In terms of this conversation, I more or less automated the review. Incidentally, this was possibly the single best decision I made in RoSE's development; on my next project, whatever that might be, I will do it even better.