Cloudflare suffered a significant outage on June 21, 2022, which affected 19 data centers which are responsible for 50% of their global traffic.
The outage was revealed by Cloudflare to be resultant of a network configuration change.
This change was part of a long-term project the company is undertaking, and all data centers were back online in under an hour. During this time a large number of Cloudflare users were unable to access websites that they rely upon for service, while other users remained completely unaffected and enjoyed a normal service.
Cloudflare currently handles more than 10% of all Global HTTP and HTTPS traffic, so a substantial amount of website users were affected by the outage. These included big name brand websites such as Pelaton, Shopify, Fitbit and Discord according to Down Detector.
The company assured users that the outage was not the result of any malicious attacks and was completely a result of their own internal errors.
What caused the outage?
For 18 months, Cloudflare has been working on updating some of their locations which deal with the most of their global traffic. They are working to update the centers with an improved and more flexible architecture that Cloudflare are referring to as “Multi-Colo PoP” also known as MCP.
Which data centers were affected?
The data centers which experienced the outage included:
- Amsterdam,
- Atlanta,
- Ashburn,
- Chicago,
- Frankfurt,
- London,
- Los Angeles,
- Madrid,
- Manchester,
- Miami,
- Milan,
- Mumbai,
- Newark,
- Osaka,
- São Paulo,
- San Jose,
- Singapore,
- Sydney,
- Tokyo
What were the updates which caused the outage?
Cloudflare was updating the 19 locations with a new architecture which adds an extra layer of routing. This will create a mesh of connections which will reinforce the data structure. It will also allow more flexibility in disabling parts of the internal network for repair and maintenance without disrupting user traffic in the future.
The outage was a result of a withdrawal of crucial prefixes during a re-ordering of terms. This withdrawal caused engineers extra difficulty when attempting to reach the affected data centers and correct the issues. The engineers used saved back-up procedures to gain access to the centers and regain control so they could rectify the problems in under an hour.
This is not the first time in recent memory that Cloudflare services have experienced substantial outages, with similar problems faced in July and August of 2020.
How will Cloudflare ensure this doesn’t happen in the future?
As a result of the outage Cloudflare shared that they have identified several areas for improvement to ensure this problem does not occur again. These include:
- Improving procedures and automation so that they include MCP-specific features to prevent unintended consequences of architecture updates.
- Redesigning the routing advertisements to prevent incorrect ordering when prefixes are withdrawn in the future.
- A staggered approach to future rollouts of network configuration updates with an automated commit and confirm rollback for when problems are detected.
Cloudflare made apologies to the customers affected and, provided assurance that the improvements and procedural changes they are making will ensure that the incident does not happen again.