It’s time for a segment I call Storytime with Uncle Eric in which I regale you with tales of woe and triumph from my past and present programming responsibilities.
This is the story of one badass little printer.
To be quite honest, I’m not sure how I got assigned this bug. I was working for a company that made big, as in taller than this writer, optical routers that folks like AT&T and L3 used to run the backbone of the internet. I had interned/contracted there for four years in the testing and automation department — someday I’ll write about that period, it was great — but they wouldn’t hire me until I graduated from college. I’m reasonably sure that’s the only reason I graduated college, so hat tip to them. Once I graduated I managed to snag a permanent position in another department after a particularly awful interview — again another funny story — writing embedded C++ to run on the real-time OS these machines used. So I’m a juniorish SW engineer, but with four years experience at the company. I know how to write Java, C, TCL, Squeak (yeah we learned some real life job skills in college, I swear). It’s cool though, I mean C++ is basically C plus Java right? Oh god, young Eric, you are so precious. Side note: someone finally took pity and handed me Seth Meyer’s Effective C++ and for that I am eternally thankful. I did things like fix random bugs that my boss tossed my way, took over maintenance of our Win32 simulators (these machines were expensive, so simulation was a big deal), expanded our smoke test infrastructure, etc. Cool enough things, but not a ton of work contributing code that ran on the hardware.
And then the bug arrived: Machines were randomly rebooting in one our clients’ labs. That’s bad.
A bit of background: these machines were about the size of a fridge with a bunch of slots, each slot took a rather large network card, aka the line module. In my mind’s eye these things are like 2’X3′, odds are it was a bit smaller, but you get the idea, it was going in the fridge not your desktop PC. Each line module could take several smaller cards that handled different rates of traffic, each of which could be configured to take over for each other if one died. There was a ton of work done on redundancy in our systems, you could set up line modules to switch over to another if one failed, if a fiber was cut you could almost instantaneously reroute, you could set up redundant routing so duplicate data was flowing in case one line went down. Then there was the control module, the heart of the beast, and if that went down the whole system rebooted.
Lets take a moment to imagine a room full of refrigerator-sized-blinky-light-fiber-optic-future-is-now machines whirring like crazy as they reboot. Like hurricane force whirring, huey chopper landing forces. That’s unsettling.
Some work had been done before I got tagged in the bug and they figured out it only happened when this printer was plugged into the lab network. On the plus side there was a solution: don’t plug that printer into the lab network. After this experience, I personally would have just gone Office Space on that printer, but to each their own. On the down side I got tasked with figuring out what the heck was going on, all within the cozy confines of my suburban Atlanta cubicle. For there was a glorious thing called Ethereal, you kids probably call it WireShark now, and someone had recorded the network traffic during one of these glorious events, attached it to a defect report and walked away.
Here’s where I come in. The youngin’. The one that wasn’t frantically writing code for the hot new tech, Gigabit Ethernet. That’s right, we were thinking about that over a decade ago. And that’s how I spent the next week. Staring at a network dump, slowly losing my grip on reality as I became those goddamned packets. Be. The. Packet. Flow like the packet. Ask yourself, am I good packet? Deep in one of these sessions something clicked. What the heck was an address of
0.0.0.0 doing in there. That’s not how IP works. You come from somewhere, packet. Well at least that’s not how it works on a good day. I am a bad packet.
So what kind of shenanigans was this packet getting up to? It was an
NTP packet. The printer wanted to know what time it was. It said:
Hi good sirs, might you know the time?
But it didn’t ask one good sir in particular, it asked everyone. And it did it in such a way that it was more like:
Hi good sirs, might you know the time? Also why don’t you contemplate Zeno’s paradox for a while and just go ahead and crash.
Okay, I found a bad packet. But why would we crash?
Some more background: You want to send out a network message, the standard way is to say: gimme a sendin’ thing on address zero. That meant I want to send stuff from this computer, just plug in it’s address as return-to-sender because I’m too lazy figure it out. Well on this unique snowflake of a machine, saying gimme the sendin’ thing on address zero actually plugged in the return address of zero, literally
0, which shouldn’t happen, but it was a only printer and we shouldn’t put such high hopes on them.
That’s my theory at least. I whip up a program in Java that sends a, raw, hand crafted packet — thus getting around the rules that say: “For goodness sake, no, you can’t send a packet with a return address of zero” — punt it at one of our test machines, scamper into the lab to see that bad boy whirring like crazy. Hells yeah, I figured it out. But now what do we do?
As part of our contract with our real-time OS vendor we had the source code. This means I can dig in to the underlying code of the network stack, the piece that was most likely choking on that little rapscallion of a packet. It should be noted I am not a kernel hacker at this point. I’m not even a kernel hacker now, I mean I guess technically I’ve done it once so now I am, but don’t hold me to that. I spelunk through the networking code, it’s about as exciting as you think, and finally find a spot where lo and behold processing a packet with a return address of zero is going to give you a very bad no good time.
So I add a
if (0) goto the great trashbin in the sky. Seriously, basically one line. I wrote one line to fix my 2 week journey through madness. Not to worry aspiring programmers! You’ll almost always get something worse: later on in life I got to play the week-staring-at-code-fixed-by-removing-exactly-1-character game. That was super fun!
And then it’s pretty simple. I compile that one file, replace it in our OS library, get it into our core build and move on with life. That’s it. Just walk away.
Hopefully that vendor fixed things along the way, it’s been over a decade, our snapshot of the code was pretty old, so I don’t feel too bad telling the story.
Coda: Now that I think about it, that poor printer must never have known the actual time, tirelessly sending out NTP requests and never getting a response. A miserable, unloved existence. And for that I am thankful.