Microsoft Service Outage

On the 16th of March, Microsoft suffered an outage of their Identity Management platform (Azure Active Directory). This service is used to authenticate almost all Microsoft services, including Office 365. This has a massive impact on not only our business, but also that of our clients. 

The outage raised several questions regarding the reliance we all have on Cloud services, how we can communicate effectively during an outage and the value of Cloud vs On Premise.

Our Global Incident Response

We have been working to strengthen our Global Incident Response process (any incident that involves more than one client) for a while.  One of the challenges is how we communicate effectively when core services such as e-mail are down. Our website, Twitter and Facebook feeds are kept up to date with the latest updates and we encourage all clients to check these if you are experiencing problems.

Another method of communication we have for Managed Services clients is to push a notification to your computers directly, although our rule is to use this method sparingly, we feel in some circumstances it is the best option. Many of you may have seen this alert:

Screenshot.jpg

The irony that the alert notification advised of an impact to our phone services whilst simultaneously advising you to call us on 0800 600 606 did not escape me. Unfortunately, the phone number is hard coded into the application and we are currently working on changing this for future messaging.

Why did Microsoft let such a critical service fail?

This is not the first time the Microsoft Identity Management platform has failed. In September last year, there was also a failure that impacted many businesses. On both occasions the cause has been identified as a software update process.  It has been identified that the critical nature of the service requires additional safeguards, Microsoft is therefore in the process of putting additional safeguards on updates to the platform. The additional safeguards have not yet made their way to the individual components affected during the recent outage; however, we are confident these changes will result in a more reliable service.  

Why was the Yorb Phone System Down?

Yorb uses Microsoft Teams as its phone system. This has a tremendous advantage in terms of our ability to work seamlessly from anywhere, a massive advantage during Covid-19 lockdowns. However, when Microsoft services go offline, we have double the impact by also losing access to our phone network. In terms of its maturity Teams is still relatively new in New Zealand. We have been working with our up-stream providers to ensure an adequate Business Continuity Plan exists to minimise interruption during an outage, while also keeping all the benefits we have come to rely on. There are still some technical challenges our providers are working through; we anticipate a more robust system over the coming months.

Our Reliance on Cloud Services

Following the outage yesterday you may be asking yourself is the risk of being so reliant on a single service worth it? Would an on-premises managed solution give me more reliability for services critical to my business?

While all businesses should continue to strengthen their business continuity processes, the rewards of a Cloud approach outweigh the risks in most situations. On the 3rd of March, the world became aware of a critical security vulnerability that was being actively used against more than 30,000 companies (See Zero Day attack article). On the evening of the 3rd our team worked through the night to patch over 30 client servers, ensuring they were kept safe. Office 365 was not affected by this security vulnerability, so consequently if you were a pure Cloud based business you remained safe with no required intervention.

There are very few businesses in the world that can match Microsoft’s investment in security, leveraging all their resources to keep the Cloud platform protected. The reality for most of us is that the security advantages of the Cloud alone outweigh the risks of unscheduled outages.

What happens now?

The rapid evolution of digital services, particularly cloud native ones, presents new business challenges. We must be able to respond to new business opportunities, developments, and risks dynamically. This requires us to embrace new technologies all while maintaining a certain amount of due caution, managing your appetite for the benefits and considering the risk. Today’s agile world means that we should be prepared for the odd and unexpected twist, as the results over the past month show, this is a delicate balance. 

We welcome any feedback you may have in response to the recent events and how we can also assist you in managing this divide.

 

Daniel.jpg  

Daniel Goymer

Technical Director of Yorb

 

 

 

Your feedback is important to us

If you have any feedback that could help us improve, please leave us a comment below.