Fess up. You know it was you.
- Create a database,
- Have organisation manually populated it with lots of records using a web app,
- accidentally delete database.
All in between the backup window.
Pretty run of the mill for me, so not that bad: Pushed a long-running migration during peak load hours that locked an important table for an extended period of time, effectively taking our site offline.
Also consider !ask_experienced_devs@programming.dev :)
Installed a flatpak app (can’t remember which one but it wasn’t obscure or shady) and smh it broke the file system on one of my main machines :) (at least I think that’s what happened because the machine started lagging, any app refused to launch and after a reboot I got an fsck error or something like that)
One time I was deleting a user from our MySQL-backed RADIUS database.
DELETE * FROM PASSWORDS;
And yeah, if you don’t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.
That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.
Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.
BEGIN TRAN
ROLLBACK TRAN
I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn’t figure out.
Normally, we got our issues in tickets forwarded to us from the individual base’s Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base’s Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!
The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?
We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base’s tree, not an individual office or squadron.
As his rank implies, he’s supposed to be the technical expert in his field. But this guy was an idiot who shouldn’t have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they’re learning how to be sysadmins. And he couldn’t even do that.
It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.
I always put the where clause first since a fuck up in my early 20s lost a loans company £40k of business.
My trick is writing it as a
SELECT
statement first, making sure it’s returning the right number of records, and then switching out theSELECT
forDELETE
. Hasn’t steered me wrong yet.This.
The hero we don’t deserve.
Always skeptical of people that don’t own up to mistakes. Would much rather they own it and speak to what they learned.
Exactly!
It’s difficult because you have a 50/50 of having a manager that doesn’t respect mistakes and will immediately get you fired for it (to the best of their abilities), versus one that considers such a mistake to be very expensive training.
I simply can’t blame people for self-defense. I interned at a ‘non-profit’ where there had apparently been a revolving door of employees being fired for making entirely reasonable mistakes and looking back at it a dozen years later, it’s no surprise that nobody was getting anything done in that environment.
Incredibly short-sighted, especially for a nonprofit. You just spent some huge amount of time and money training a person to never make that mistake again, why would you throw that investment away?
I worked for a company where the testing database was also the only backup.
“Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.
“Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).
Apparently Terminate means stop and destroy. Definitely something to use with care.
Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.
Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something
deleted an entire column in a police department’s evidence database
Based and ACAB-pilled
And if you couldn’t reconstruct, you still had backups, right? … right?!
What the fuck is a “backups”?
He’s the guy that sits next to fuckups
Plugged a server in after it had been repaired but the person whose responsibility it was insisted it would be fine - they didn’t release the FSMO roles from it, the time was an hour out, it changed the time EVERYWHERE and broke ALL THE THINGS. Not technically my fault, but i should have pushed harder for them to have demoted it before I turned it back on.
Updated WordPress…
Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.
Fuck WordPress for static sites…
Two exhibitors, both alike in
dignitynaming. One needed a critical sw update on their Doremi to fix an issue. The other was running The Force Awakens to a packed auditorium.That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.
Flushed the entire AD not realizing I somehow got back into prod
I was still a wee IT technician, I was supposed to remove some cables from a patch panel. I pulled at least two cables that were used as ISCSI from the hypervisors to the storage bays. During production hours. Not my proudest memory.
Set off cascading event bus loops that ran out of control. Friends don’t let friends allow events to spawn more events.
Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.