The Worst Typo I Ever Made

"When 'undo' won't do..."
StukaFoxsays...

The worst DevOps mistake I ever made:

Assignment: On ~1,000 -physical- RHEL systems, change the default run level from command line to GUI (don't ask).

Solution: Hey, all our config files are controlled by Puppet, so this'll be easy!

(If you don't know what Puppet does, it enforces file configurations, so if you change a single file on the Puppetmaster, that change is pushed out to all servers running Puppet)

Ok, all I need to do it edit a single file, change a single number in said file and issue a single command: reboot. Easy-fuckin'-peasy.

The file I need to change is /etc/inittab -- this file tells a Linux system which "run level" it should initiate upon booting up. runlevel 3 is command line and runlevel 5 is a GUI like Gnome or some other tragic perversion of the whole reason you run Linux in the first place. All I had to do was change from runlevel 3 to runlevel 5. And reboot.

So simple; so stupidly simple.

So stupidly simple at 3:00am. When I hadn't slept all night. On a production network. When I'm working from home away from the office. On a Saturday when no one is in said office.

I make my change and save it, then push it to the version control system. Puppet picks it up and pushes the change to ~1,000 physical computers.

Done and done!

Remember I mentioned that I had to change a single file AND execute a single command: reboot?

Here's where things go tragically wrong.

My changes worked PERFECTLY. Everything did exactly what I told it to: Puppet changed the file, and rebooted the servers.

Only they keep rebooting. They keep rebooting over and over and over and over. I can't access any server on the network. Worse, while I'm trying to figure out WTF I did wrong, the 30 minute time-out I'd set on our alerting system, Nagios, expires.

Did I mention that I pushed this change to ~1,000 servers? ~1,000 servers that won't stop rebooting and aren't reporting into Nagios, thus being marked as down?

At 3:31am, on Saturday morning, the pages to ALL the on-call engineers began. One page per engineer per machine. About one every two seconds. And I'm getting paged, too -- except some of the pages are Nagios and some are utterly irate engineers who want to know exactly WTF is going on and I can't tell which is which because I'm getting text-spammed like crazy.

And those servers? They just keep right on rebooting.

At that point, I felt the kind of existential dread that only people who work in IT know -- the kind of dread that arises a picosecond after you've hit ENTER and realized you've type 'rm -rf /' or some-such -- because I knew at that very second exactly what I'd done wrong.

I'd typo'd "5" and made it "6" in the runlevel. And pushed it to ~1,000 -physical- servers. And then rebooted them ALL.

"So," you're asking, "Whyfor is runlevel 6 a big deal?"

Because of this:

runlevel 3: command line.
runlevel 5: GUI
runlevel 6: REBOOT THE FUCKING COMPUTER.

What I'd done was told every production server on our network to reboot as soon as it rebooted, which leads to another reboot, which leads to another reboot, lather rinse repeat.

At 3:45am on Saturday morning, I knew that every person in IT would have to drive into the office, visit every production server with a bootable USB key, change the BIOS to boot off the key, boot the server into Single User Mode, change the damned file by hand, then reboot the server. This takes about 10 minutes per server -- times ~1,000.

I learned a number of valuable lessons that day:

1. DOUBLE CHECK YOUR FUCKING WORK.
2. See lesson #1
addendum: filing for unemployment insurance in Washington state is amazingly easy.

And that was the very last time I ever worked on physical hardware. To this day, if it's not in the cloud, I ain't fucking touching it.

Here endth the lesson.

spawnflaggersays...

I think for any automated management system, the prudent thing to do would be to test a small subset of servers before pushing the change to all of them. So you might have only had 10 or 100 servers in a reboot loop instead of all 1000.

Also, any SysOps would have the cojones to push back on the initial change request to boot into GUI mode - you said these are servers right?

that said, never delete /dev/null.

StukaFoxsaid:

The worst DevOps mistake I ever made:

Assignment: On ~1,000 -physical- RHEL systems, change the default run level from command line to GUI (don't ask).

Solution: Hey, all our config files are controlled by Puppet, so this'll be easy!
...
Here endth the lesson.

antsays...

QA test it! Were you a newbie? I'm paranoid these days. I always make back ups and test in case something goes wrong. SOMETHING ALWAYS GOES WRONG even if it is simple and easy. There are always hidden surprises! Also, I give myself a lot of time if something is critical in case something does go wrong! I just don't trust anything these days.

StukaFoxsaid:

The worst DevOps mistake I ever made:

Assignment: On ~1,000 -physical- RHEL systems, change the default run level from command line to GUI (don't ask).

Solution: Hey, all our config files are controlled by Puppet, so this'll be easy!

(If you don't know what Puppet does, it enforces file configurations, so if you change a single file on the Puppetmaster, that change is pushed out to all servers running Puppet)

Ok, all I need to do it edit a single file, change a single number in said file and issue a single command: reboot. Easy-fuckin'-peasy.

The file I need to change is /etc/inittab -- this file tells a Linux system which "run level" it should initiate upon booting up. runlevel 3 is command line and runlevel 5 is a GUI like Gnome or some other tragic perversion of the whole reason you run Linux in the first place. All I had to do was change from runlevel 3 to runlevel 5. And reboot.

So simple; so stupidly simple.

So stupidly simple at 3:00am. When I hadn't slept all night. On a production network. When I'm working from home away from the office. On a Saturday when no one is in said office.

I make my change and save it, then push it to the version control system. Puppet picks it up and pushes the change to ~1,000 physical computers.

Done and done!

Remember I mentioned that I had to change a single file AND execute a single command: reboot?

Here's where things go tragically wrong.

My changes worked PERFECTLY. Everything did exactly what I told it to: Puppet changed the file, and rebooted the servers.

Only they keep rebooting. They keep rebooting over and over and over and over. I can't access any server on the network. Worse, while I'm trying to figure out WTF I did wrong, the 30 minute time-out I'd set on our alerting system, Nagios, expires.

Did I mention that I pushed this change to ~1,000 servers? ~1,000 servers that won't stop rebooting and aren't reporting into Nagios, thus being marked as down?

At 3:31am, on Saturday morning, the pages to ALL the on-call engineers began. One page per engineer per machine. About one every two seconds. And I'm getting paged, too -- except some of the pages are Nagios and some are utterly irate engineers who want to know exactly WTF is going on and I can't tell which is which because I'm getting text-spammed like crazy.

And those servers? They just keep right on rebooting.

At that point, I felt the kind of existential dread that only people who work in IT know -- the kind of dread that arises a picosecond after you've hit ENTER and realized you've type 'rm -rf /' or some-such -- because I knew at that very second exactly what I'd done wrong.

I'd typo'd "5" and made it "6" in the runlevel. And pushed it to ~1,000 -physical- servers. And then rebooted them ALL.

"So," you're asking, "Whyfor is runlevel 6 a big deal?"

Because of this:

runlevel 3: command line.
runlevel 5: GUI
runlevel 6: REBOOT THE FUCKING COMPUTER.

What I'd done was told every production server on our network to reboot as soon as it rebooted, which leads to another reboot, which leads to another reboot, lather rinse repeat.

At 3:45am on Saturday morning, I knew that every person in IT would have to drive into the office, visit every production server with a bootable USB key, change the BIOS to boot off the key, boot the server into Single User Mode, change the damned file by hand, then reboot the server. This takes about 10 minutes per server -- times ~1,000.

I learned a number of valuable lessons that day:

1. DOUBLE CHECK YOUR FUCKING WORK.
2. See lesson #1
addendum: filing for unemployment insurance in Washington state is amazingly easy.

And that was the very last time I ever worked on physical hardware. To this day, if it's not in the cloud, I ain't fucking touching it.

Here endth the lesson.

Send this Article to a Friend



Separate multiple emails with a comma (,); limit 5 recipients






Your email has been sent successfully!

Manage this Video in Your Playlists




notify when someone comments
X

This website uses cookies.

This website uses cookies to improve user experience. By using this website you consent to all cookies in accordance with our Privacy Policy.

I agree
  
Learn More