“After years of IT experts telling us that we ‘can’t break the internet’ by pressing the wrong button, it turns out we can do it by updating our settings.”
Phil Coughlin, The Guardian, Jun 11, 2021
On June 8, an internet blackout was precipitated by one customer updating their settings through a “valid configuration change”. With speed, 85% of the network of the tech infrastructure company Fastly began returning errors. A global outage ensued. “The downed sites,” according to Brian Barrett of Wired, “shared no obvious theme or geography; the outages were global, and they hit everything from Reddit to Spotify to The New York Times.”
Nick Rockwell, the Senior Vice President of engineering and infrastructure at the company, outlined the incident in a blog post. “We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.” The bug had been introduced in a software deployment on May 12 “that could be triggered by a specific customer configuration under specific circumstances.”
Fastly’s role is important, since it, along with such entities as Akamai and Cloudflare, constitutes part of the content delivery network (CDN) essential to the internet’s infrastructure and the speed with which information is relayed. Such CDN entities are physical manifestations in utilising servers to minimise download times. They supply a service that enables websites, notably those attracting heavy traffic, to retain copies of their pages “closer” to their customers.
Angelique Medina, director of product marketing at network monitoring firm Cisco ThousandEyes, offers an explanation of that function. “It basically enables really high performance for content, whether that’s streaming video or a site or all the little images that pop up when you go to an ecommerce site.” Reuters similarly describes this as offering “a better experience for users, enabling pages to load quicker and sites to manage high volumes of page requests better, for example in a breaking news situation.”
The drawback of having such an intertwined system populated by so few providers is that any modest hiccup in the supply conveyed via the services of the CDN network can result in a global blackout. This stands to reason: a beast such as Akamai has 340,000 servers on its platform deployed across 4,100 locations across 130 countries. This problem might be rectified by having websites host their own content exclusively, but that, as Paul Haskell-Dowland points out, would slow web browsing and undermine that fetish cyber cognoscenti call the “experience”.
Such incidents have become recurring features of shock in the tech landscape. Initially, they generate a flash of discussion, but are quickly submerged by the banality of technological acceptance. Cloudflare itself experienced problems in 2019 with an outrage that disrupted Soundcloud, Medium and Dropbox. The explanation given then was similar to that of Fastly: the outage had resulted from a “bad software” deployment that caused a “massive spike in CPU utilization” on the company’s network. “Once [the software was rolled back] the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.”
A certain degree of error on the part of CDN providers tends to be tolerated, even readily exonerated. In the week of June 12, global internet outages rose by 43%, or 481 in total. 317 of these took place in the US. As this was happening, the stock market was busily rewarding the very agents behind such outages. Fastly’s stock price rose through June while Akamai’s share price rallied after June 18. Such a centralised market tends to deliver riches while ignoring, as Geoff Huston of the Asia Pacific Network Information Centre observes, “a minor inconvenient truth about the less-than-solid foundations of the technology, and incidents that impact operations that continue to happen.”
And just to cause more ripples of excitement, Akamai became the second CDN provider to suffer an outage later in June for one of its Prolexic DDoS mitigation services. As the company mentioned in a statement, “A routing table value used by this particular service was inadvertently exceeded. The effect was an unanticipated disruption of service.” Outages were subsequently felt across banking services, many located in Australia, a number of airlines and the Hong Kong Stock Exchange. The public relations departments across Akamai’s client base were scrambled to dampen any panic.
Reactions from the CDN high priests to these disruptions are rehearsals of apology followed by businesslike solutions. They know they are the titans with few contenders. Rockwell’s response served to mask the more critical issues of CDN concentration. “Even though there were specific conditions that triggered this outage, we should have anticipated it.” But he emphasised the speed of detection and rectification. The disruption was detected within one minute, “then identified and isolated the cause, and disabling the configuration. Within 49 minutes, 95 per cent of our network was operating normal.” The company, he promised, would “figure out why we didn’t detect the bug during our software quality assurance and testing processes.”
The irony of such outrages is that they defy the spirit of decentralisation that was meant to underlie the web. As David Warburton of cybersecurity company F5 Labs rightly notes, the past decade has borne witness to “the unintentional centralisation of many core services through large cloud solution providers like infrastructure vendors and CDNs.” Economies of scale have prevailed and competition all but quashed. The “comparative shopping list is not exactly large,” remarks the ever valuable Huston, if you wish to choose a CDN that optimises “service delivery yet leaves the customer in control of such critical aspects of the security and integrity of the service (such as private keys)”. Till that problem is addressed, the disruptive outage will become the tolerated manifestation of an unacceptably centralised market. Scoop