Six months ago, Hurricane Sandy posed a major challenge for data centre operators across America’s north-east coast, and the lessons learned by businesses in that area are useful for anyone trying to plan their disaster recovery (DR) strategies. Here are ten issues to keep in mind, covering everything from checking that your software licences are still valid in your DR site to making sure staff have access to essential medicines.
Hurricane pictures: Getty
The opening panel at Data Center World in Las Vegas, which I’m attending as part of Lifehcker’s ongoing World Of Servers coverage, featured two speakers who were impacted by Sandy, which ripped through many of the mid-Atlantic coastal states in late October last year. Alex Delgado is the global operations and data centre manager for International Flavors & Fragrances, a scent and flavouring manufacturer whose Union Beach data centre provided access to over 50 core business systems. Donna Manley is the IT senior director at the University of Pennsylvania, and manages a data centre located above a food court in busy downtown Philadelphia. Here’s what they learned.
Lesson 1: It can always get worse
Any sane data centre manager will have a disaster recovery plan, but you can still get caught by surprise. “When we set out to create our DR plan, we tried to plan for the impossible but it turned out to be for the unbelievable,” Delgado said. “Nothing we could have prepared for would have ever been able to get close to where we needed to be when we got hit with the storm. At that point, as a data centre manager, I wanted to probably quit.”
Further complicated matters, IFF’s offshore operations team in Chennai was hit with a cyclone six hours after Sandy, which meant those staff weren’t available either. And the bad luck continued: “Only one pole came down in our area, but it carried half our Internet and all our power,” Delgado said.
Modern weather forecasts mean that awareness of Sandy was high well before the storm hit. Delgado began planning a week ahead, as did Manley. Yet that early warning can also give you more time to worry. “We had seen in prior heavy rainstorms and hurricane activity that moisture actually comes up through the floor where we have our substations and UPS batteries,” Manley said.
Lesson 2: Physical access can be difficult
Natural disasters don’t just cut power — they also cut roads, which can make delivering replacement equipment impossible and stop staff from being able to reach the site. “When the storm hit our first responders were also impacted so it was a challenge to get them all on site — some of them never got there,” Delgado said. “Thankfully we were able to have a lot of them work from home.”
Damaged roads are often tightly policed. “People not allowed on the roads caused a very big problem for us,” Manley said.
That also impacted the supply of parts for damaged equipment. “We couldn’t just drive an 18 wheeler into our parking lot, so we had to be a little creative about how we got our equipment in,” Delgado said. Manley began ordering backup equipment as soon as the storm warnings began.
Lesson 3: The business may change its mind
DR plans are usually agreed in advance, and business leaders are asked to identify what they see as the most critical platforms to restore when there’s an emergency. Those decisions won’t always gel with operational reality.
“We employ 150 of the world’s 500 perfumers,” Delgado said. “There’s not too many of them around, and they’re a finicky group. If we’re developing something for a major brand and they expect it in a week, if the system’s down, it’s a problem. We planned in our DR that we could support a week of outage. When we went down, after three days it was like ‘The client is calling us’. I had to tell them: ‘Talk to legal, I’m busy’.”
That attitude can become apparent even when performing routine pre-DR maintenance, as Manley discovered during a scheduled outage for the university’s power system. “The bigger issue for us is not the infrastructure power down, it’s getting the outage window and trying to convince people we need to have it.”
“Review all your non-critical systems,” Delgado advised. “If the business says ‘we don’t need it back right away’, go back and ask them ‘are you sure?’ You don’t want the business coming back and saying ‘that thing that wasn’t so important, we actually need it’.”
Lesson 4: Plan your backup windows carefully
When disaster is looming, regular backup plans may need to be changed. “We made a decision to terminate our backup storage and mirroring process,” Manley said. “This was a pre-emptive strike so that if we lost power, that mirroring process did not corrupt the data.”
Restoring from a recovery site can also take longer then expected. “We have over 50TB of data we replicate to our DR site,” Delgado said. “You can’t just copy 50TB and switch back after three weeks.” IFF didn’t fully fail back to its original centre until December.
Lesson 5: The challenges of staff
We’ve already alluded to the importance of knowing whether staff will be able to reach the site to perform DR tasks. “My big concern was about getting people to the site,” Manley said. “We have been in cost containment and have implemented a ton of automation. The good news is we have a lot of automation. The bad news is we don’t have a lot of bodies because of that automation, so understanding the geographic diversity of the staff was important to us. “
Ensuring teleworking systems are set up can help, but isn’t infallible. “Although a number of people can work from home, Internet access and electrical outages hit us pretty hard. All we can do is make sure that the telecommuting capabilities were there. Obviously people were at risk of losing power, but we gave them the tools necessary so they could do what we needed them to do.
Technical issues aside, it can be difficult to ask people to put aside family concerns during a natural disaster. “We had a lot of newlyweds and new parents on staff and they weren’t going to leave their spouses and newborns at home with no power,” Delgado said.
“All you can do is ask and be understanding that obviously a person’s family is going to come first,” Manley said. For core staff, the idea that they’ll be available for DR work should be defined as part of the job. “Use those opportunities around job descriptions and performance appraisals to set expectations. Everybody in the business knows that’s kind of a given,” Manley said
Lesson 6: Keep track of non-IT resources
Manley’s checklist for essential supplies includes changes of clothing and medication for any staff member who might need them. IT staff with medical needs are asked to keep a three-day supply in a sealed envelope in their desk to ensure that doesn’t become an issue during an extended stay.
Lesson 7: Check licensing and contracts carefully
“Not everything is always defined properly in a formal DR policy,” Manley noted. If you’re planning to shift to a secondary site, you need to ensure software licences allow it. “We’re a big Remedy shop,” Manley said. “All of my workflows is automated. I needed to make sure that Remedy and BMC proper were going to be available – not just technically but from a licensing standpoint.”
You also need to be aware of contractual arrangements with your backup service providers. IFF didn’t shift its provider to standby until the storm was close, since doing so would be expensive. “There’s a dollar cost to putting them on standby, so we didn’t want to go too early,” Delgado said.
Manley took a different approach. “I had our provider on alert 4 days prior. In our contract we do not have to pay to be put on alert. Read that fine print because there are those kind of differences you can negotiate.”
Lesson 8: Stay friendly with your vendors
Even if you can source replacement gear, paying for it can be challenging. “I spent close to $2 million in four days,” Delgado said. That’s tricky when your company card limit is $50,000 and your internal procurement systems are down. “If you don’t have a good relationship with your vendors, start shaking some hands.”
Lesson 9: The cloud is your friend
Cloud can be useful for other services as well. “What we had done prior to the Sandy situation was to make documentation available on Box.net We were glad we did that. For us, documentation was a big one. We didn’t have to worry if the servers went down. Software and software tools were automated, but we needed to know what we had to bring up and what we could get away with curtailing.” Post-Sandy, both sites foresee a bigger expansion into cloud services.
Lesson 10: You can overcome problems
Despite all those challenges, both IFF and the University of Pennsylvania weathered the crisis.The full crisis mode for the university only lasted for 28 hours. That served as proof that the basic DR plan was solid, but it also provided some unexpected learnings.
“From a process standpoint, one of the things we have done in the past was a planned power outage to do some maintenance,” Manley said. “During the crisis, we found that the power-down and power-up plan wasn’t current — we need to tie that into our change management process That was a big process gap we never would have found without a situation such as this.”
IFF had its main centre working again within 12 hours of the disaster, and all key systems up within 72 hours.
“We didn’t lose a single order,” Delgado said. “We put some on standby, but we’ve earned the respect of our customers, and they gave us a little slack. The longest order was maybe delayed by four hours. We were lucky and we did well.”
Lifehacker’s World Of Servers sees me travelling to conferences around Australia and around the globe in search of fresh insights into how server and infrastructure deployment is changing in the cloud era. This week, I’m in Las Vegas for Data Center World, looking at how the role of the data centre is changing and evolving.