Breaking News

Glitch in Cloudfare broke the internet

by VARINDIA 2019-07-05

Cloudflare, a content delivery network that services 16 million websites across the world. Among other security- and performance-related services, Cloudflare’s network distributes website data to servers in nearly every major city and allows website users in locations across the world to download this information at a lightning-fast pace, especially during peak traffic times. Each day, 1 billion unique IP addresses access websites serviced by Cloudflare.Got affected both times.

Cloudflare enjoys a hefty market share in the cloud infrastructure niche, which, in 2018, exceeded $80 million. According to Datanyze, Cloudflare holds the leading position within the industry at a 34.5% market share. Amazon Cloudfront follows at a close second, with a 28.8% market share. Google Cloud and Microsoft Azure are other popular solutions. The 502 Bad Gateway Error due to the global outage across Cloudfare's network .The incident happened at 13:42 UTC on July 2 and lasted for 30 minutes.

Matthew Prince, CEO, Cloudflare was quick to debunk these rumors and come forward with the truth: it was a Cloudflare DDoS error that triggered the outage, not an external issue. The reason due to massive spike in CPU utilization in the network and the bug in the Cloudflare firewall consumed a massive portion of the system’s resources, which would normally be allocated toward content delivery. Prince has apologized, recognizing the impact that an outage can have on businesses worldwide and I take personal responsibility for that.

A week ago, Cloudflare clients confronted a noteworthy blackout when Verizon unintentionally rerouted IP bundles after it wrongly accepted a system misconfiguration from an internet service provider in Pennsylvania, USA. This time, the outage was a result of a single misconfigured rule inside the Cloudflare Web Application Firewall (WAF), that led to an increase of Cloudfare's network CPU usage, which then got scaled across different global geographies.

The incident happened at 13:42 UTC and lasted for 30 minutes. Visitors to Cloudflare-proxied domains received 502 errors due to the global outage across Cloudfare's network. This affected thousands of prominent webpages, including some big tech brands. Facebook and its group companies including WhatsApp and Instagram suffered outages relating to image display for the majority of the day. The issue was due to varying timestamp data fed to the social media giant's CDN in some image tags. Facebook also displayed varying timestamp arguments embedded in the same image URLs.

The misconfigured rule was created during a routine deployment of new Cloudflare WAF Managed rules. Despite the fact that the organization has automated frameworks to run test suites and a methodology for deploying configurations in a progressive manner to avert incidents, the new WAF guidelines were deployed internationally in one go and caused yesterday's blackout. The new rules were created to improve the blocking of inline JavaScript that is used in malicious attacks.

Cloudflare guarantees 100% up-time and 100% delivery of content in the signed SLA and so the recent event has caused the company to breach its SLA, even though it was unintentional and could have been prevented had the company planned its new Cloudflare WAF Managed rules better. Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc., which the CTO John Graham-Cumming clarified was not true in his tweet.

The rules were being implemented in a simulated mode where issues were identified and logged according to the new rules but no customer traffic was blocked. This was done like this so Cloudfare can measure false positive rates and ensure that the new rules do not cause problems when they get deployed into full production. But, things didn't go according to plan as the new rules also contained a regular expression that caused all the havoc. Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%," wrote John Graham-Cumming, CTO of Cloudfare.

According to Cloudfare, the CPU exhaustion event that it witnessed was unprecedented as the company had never experienced a global exhaustion in the past. In the wake of discovering the real reason for the issue, Cloudflare pulled the plug on the new WAF Managed Rules, which right away dropped CPU back to typical and reestablished normal web traffic.