Microsoft’s Azure cloud service aims for near-perfect uptime, but that doesn’t mean it is immune from major disasters. Here are some examples of the mistakes Microsoft has made that have caused Azure to temporarily break, and the lessons you can learn as a developer no matter what cloud platform you deploy.
Cloud picture from Shutterstock
Microsoft technical lead Mark Russinovich gave a presentation at the recent Build Microsoft developer conference in San Francisco discussing some of the problems that have occurred for Azure. This is a selection of the issues he highlighted.
“Some of these rules, you might say ‘duh that’s obvious’ and some of them are obvious, but for some reason we keep making them,” he said. And that’s not a good thing: “The cloud is very unforgiving. It will remind you of every mistake you make.”
Mistakes are also inevitable. “It’s a world where you test in production,” Russinovich said.
Mistake: Not being strict about case
One extended outage of the Azure portal for some customers was eventually traced to a change which altered the case sensitivity of the service. The update forced resources to be referred to with names entirely in lower case, and broke if those names were in mixed or upper case.
“You need to be sensitive about case,” Russinovich noted. “Case has to be handled on a case-by-case basis.” [You may groan if you wish.]
A good rule of thumb? User-friendly text should be case sensitive, but other elements should’t. And whatever rules you adopt, make sure you document the expected case mix at any service exit points.
Mistake: not logging everything in context
“We run into the same issues again and again with logging,” Russinovich said: not logging enough. The more detail you have about how a service operates, the more chance you have of working out what happens when something goes wrong. Error logs need to report names and specify exact features, not just basic activity.
“Log as if nobody was looking,” Russinovich said. “You never know when that extra piece of info you put in there can shave hours off a debugging session.”
Mistake: Failing to respect code hygiene
Code hygiene always matters. A single extra comma once flooded the entire Azure audit logging system.
You should also resist the temptation to ignore what appear to be minor errors. “Don’t suppress compiler warnings, ever,” Russinovich said.
He advises a simple approach to dealing with exceptions: don’t fail if you can avoid it, but don’t pretend it didn’t happen either. “Crash if it’s not user-initiated and unrecoverable, otherwise log and return. Don’t throw exceptions for expected errors.”
Mistake: Assuming input data is valid
One Azure customer who repeatedly generated new system images couldn’t work out why they were repeatedly failing to provision. The answer proved to be a corrupt VHD which had slipped into the system because there was no integrity checking.
“In the cloud, assume data can get corrupted in flight, on disk, as you handle the data,” Russinovich said. “Perform integrity checks at every stage.”
That issue highlights the importance of automating as many processes as possible in a cloud environment. “Many of our incidents are caused by humans making mistakes in places where we hadn’t yet automated.”
You also need to ensure those input mechanisms are up to the task. “At the rate of data transmission we’ve got, we have to use CRC 64 for checksums because there’s that much volume that CRC 32 is going to let network errors slip through.”
Mistake: Updating everything at once
Cloud computing requires updates to live environments, but those updates don’t necessarily have to be in parallel. Using a combination of a staged testing environment and a production environment is essential.
Russinovich suggests updating a single instance and directing a small percentage of traffic to that instance. Check the failure rate compared to the existing live version, and don’t update again unless those rates are comparable.