Watch the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by upskilling and scaling citizen developers. Watch now.
As we settle into the time of year when we reflect on what we’re thankful for, we tend to focus on important basic things like health, family, and friends.
But on a professional level, IT professionals (ITOps) are thankful that they can avoid catastrophic outages that can cause confusion, frustration, lost revenue, and damaged reputations. The terribly last thing ITOps, Network Operations Center (NOC) or Site Reliability Engineering (SRE) teams want while eating their turkey and enjoying time with family is to be called out about an outage. These can be extremely expensive – $12,913 per minutein fact, and up to $1.5 million per hour for larger organizations.
However, to understand the peace of mind that comes with avoiding downtime, you must have experienced the pain and anxiety associated with downtime firsthand. Here are a handful of horror stories ITOps pros are eager to avoid this season.
A case of janky chain of command
An experienced IT professional was on duty with three others when it was 7 p.m. The crew received an alert about a problem with the front-end user interface of the Global Traffic Manager appliance. Fortunately, there was a runbook for it in a database, so it looked like the problem would be solved soon. One of the team members saw two things to type in: a command and a secondary input. He typed in the commands and, based on what the runbook looked like, waited for the command line to ask for input, such as “what do you want to reboot?”
The way the chain of command was set up, if you didn’t give any input, the device would reboot itself. He typed what he thought was the right command – “bigstart, restart” – and the entire front-end global traffic manager was removed.
As a reminder, this took place in the early evening. The client was a finance company and the system went down just around the time businesses were closing and trying to do their accounting and other financial tasks. Terrible timing, to say the least.
Five minutes after the outage, the ITOps team realized what had happened: the tool they were using for their runbook used text wrapping by default, so what appeared to be two separate commands was actually just one. Although the outage was relatively brief, it came at a critical time and set off a chain reaction of headaches. Lesson learned? Make sure your job structure is optimized.
When Google is your best friend in the middle of the night
For a 15+ year IT veteran, what seemed like a quiet night shift quickly turned into a terrifying nightmare. “I never panicked as quickly as when the remote terminal I was in suddenly went out,” he said.
What he was trying to do was restart a service while working on a remote computer, but accidentally disabled the network connector in the process. Calling someone and waking them up in the middle of the night to tell them that they had “destroyed” a network adapter wasn’t ideal, so he and his teammates did some digging.
After what he calls “not an insignificant amount of Googling”, he was able to find his way to a Dell server and reboot the network adapter from there. It took longer than necessary to solve the problem, but the problem has finally been solved.
His pro tip: “Don’t disable the network adapter on a computer you’re remotely controlling in the middle of the night.” That may sound obvious, but the underlying lesson is to have a contingency plan in case something goes horribly wrong.
ITOps: Leaning on email used to be great – until it wasn’t anymore
When email was the primary way NOC teams received alerts, a longtime IT pro recalls having a teammate whose only job was essentially dispatch: checking emails and creating tickets for incidents that now needed attention, and others for those they could turn to later. The system worked well, but it was actually a time bomb waiting to explode, as this was a large multinational company.
That fear came true when the company’s entire data center went down.
This was a series of problems in itself, but the incident generated so many email alerts that it also crashed the corporate Outlook server. “At that point you are really blind,” recalls this IT hero.
The event happened to happen in the middle of the night, so the team on duty reluctantly had to wake up fellow teammates. After the problem was finally solved, the team developed a sense of humor about it. As they recalled, “We used to joke that we did DDoS ourselves with our own alert sound. Good times!”
Ultimately, the overarching moral of the story is this: every time a hand touches a keyboard, there’s a risk that something could go wrong. Of course, this is sometimes unavoidable, but teams that can automate and simplify their IT processes as much as possible are giving themselves the best chance of avoiding costly outages so they can enjoy their Thanksgiving feasts undisturbed.
Mohan Kompella is vice president of product marketing at BigPanda.
Data decision makers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers