mardi 3 mars 2015

How Microsoft is stamping out Azure failures and improving reliability

Introduction


In 2014, Microsoft's Azure cloud service was more popular than ever with businesses – but it also suffered some very visible outages. In the worst, not only did the failure of Azure Storage take out many other Azure services, but details for companies trying to get their systems back were few and far between.


We spoke to Azure CTO Mark Russinovich about what went wrong, what Microsoft can do to make Azure more reliable, and how it's going to improve in terms of keeping customers informed.


Major faux pas


We've known since November that the problem with Azure Storage wasn't just that a performance improvement actually locked up the database – it was one of the Azure engineers rolling out the update not just for the Azure Table storage it had been tested on, but also for the Azure Blob storage it hadn't been tested with – and that's where the bug was. Because the update had been tested, the engineer also decided not to follow the usual process and deploy it on just one region of Azure at a time and rolled it out to all the regions at once.


That caused other problems, because Azure Storage is a low-level piece of the system. "Azure is a platforms platform that's designed in levels; there are lower levels and higher levels and apps on top," Russinovich explains. "There are dependencies all the way down, but we have principles like one component can't depend on another component that depends on it, because that would stop you recovering it.


"We've designed the system so the bottom-most layers are the most resilient because they're the most critical. They're designed so they're easy to diagnose and recover. Storage is one of those layers at the bottom of the system, so many other parts of Azure depend on Azure Storage."


Automation means fewer errors


Despite the complexity of Azure, problems are more likely to be caused by human error than bugs and the best way to make the service more reliable is to automate more and more, Russinovich explained. "We've found through the history of Azure that the places where we've got a policy that says a human should do something, a human should execute this instruction and command, that that's just creating opportunities for failure. Humans make mistakes of judgement, they make inadvertent mistakes as they try to follow a process. You have to automate to run at scale for resiliency and for the safety of running at scale."


The team developing Azure is responsible for testing their code, both basic code tests and tests of how well it works with the rest of Azure, using a series of test systems. "We try to push tests as far upstream as possible," Russinovich told us, "because it's easier to fix problems. Once it's further along the path, it's harder to fix it – and it's harder to find a developer who understands the problem."


That caused a problem in another outage, when performance improvements in SQL Azure hadn't been documented and it took time both to find the problem and track down a developer who knew about the code.


After testing, building and deploying the code is supposed to be fully automated, based on what's been tested. "When the developer put the code in the deployment system the setting he gave it was enabled for blobs and tables, and that was a violation of policy," he told us. "The development system now takes care of that configuration step, rather than leaving it up to the developer."


That should mean that not only will this particular problem not happen again, but no-one else can make the same sort of error on a different Azure service.


Automatic roll back


Ideally, the system would also roll back updates that cause problems automatically; this already happens on Bing, but that's not always possible. "We have some metrics that allow us to tell if a build is doing well or not and to go backwards if we need to. But that's not always something we can do. For example if we've made an update to a database schema, we can't roll it back because you'd lose data. If we've put out a feature that allows a customer to create something, we can't go back because they'd lose what they've built. So we have to do what we call roll forward – fix the bug and get it out."


Many Azure failures happen without anyone ever noticing, he points out. "The system is designed to auto recover from failure as much as possible. For example, with a service like AzureDB or Azure Storage there are machine failures all the time in the clusters because we're running at such large scale. We lose 2-3% of our servers every year but it's completely transparent to anybody running on those services because of the way they're designed. We monitor it to see if there's a problem causing those failures, but a server failing is invisible to customers."


Lessons in Azure


One lesson that Microsoft has learned from the most recent Azure outages is that it needs a system outside Azure where customers can check on the status of the service. "The dashboard is a higher level piece of the platform and it depends on parts of the system above Azure Storage and the core services. It needs to be higher level to be public facing, but it totally makes sense that even if Azure is down that we need to communicate with customers. So we've come up with a system to fail that dashboard over outside Azure if it runs into problems because it's depending on another part of Azure."


But Microsoft isn't going to try that with any other Azure features, he says. "It's just a communication interface – it's not a service. We can't fail over an Azure service out of Azure, because we'd need another Azure to run it on!"







from TechRadar: All latest feeds http://ift.tt/1wPdX22

via IFTTT

0 commentaires :

Enregistrer un commentaire