Will the Internet’s Increasing Speed, Size, and Complexity Lead to Catastrophe?
VGO associate Ted Belding talks about what he's excited about now. You can read his blog at www.beldingconsulting.com/blog.
There was a thought-provoking piece in the New York times a while back:
As the Arista founders say, the promise of having access to mammoth amounts of data instantly, anywhere, is matched by the threat of catastrophe. People are creating more data and moving it ever faster on computer networks. The fast networks allow people to pour much more of civilization online, including not just Facebook posts and every book ever written, but all music, live video calls, and most of the information technology behind modern business, into a worldwide “cloud” of data centers. The networks are designed so it will always be available, via phone, tablet, personal computer or an increasing array of connected devices.
Statistics dictate that the vastly greater number of transactions among computers in a world 100 times faster than today will lead to a greater number of unpredictable accidents, with less time in between them. Already, Amazon’s cloud for businesses failed for several hours in April, when normal computer routines faltered and the system overloaded. Google’s cloud of e-mail and document collaboration software has been interrupted several times.
“We think of the Internet as always there. Just because we’ve become dependent on it, that doesn’t mean it’s true,” Mr. Cheriton says. Mr. Bechtolsheim says that because of the Internet’s complexity, the global network is impossible to design without bugs. Very dangerous bugs, as they describe them, capable of halting commerce, destroying financial information or enabling hostile attacks by foreign powers.
It’s commonly known that the Internet itself was designed to be robust to failure because of its decentralized nature. The problem often seems to be that centralized systems have been built on top of the Internet over the years, such as the Domain Name System (DNS) and services run by a single entity such as Google, Facebook, Amazon, etc. It’s incredibly annoying when your DNS server or Gmail goes down: The global Internet as a whole may still be working, but that’s not much of a consolation when you yourself can’t access the Web or your email! (This is a big reason why Google really needs to provide more offline editing and storage features for its apps such as Google Docs, Calendar, and Gmail. Apple’s iCloud is much more robust to failure in this regard.)
From a complex systems standpoint, there are two separate issues to consider here, each of them important:
- How do we keep the Internet decentralized, and prevent it from becoming so dependent on a single service such as DNS, Amazon Web Services, or Facebook, that a failure of that service would be catastrophic?
- The issue raised in the NYT article: Given that these services are becoming complex enough and are used so much that bugs and failures are inevitable, how do we engineer them so that they’re robust to failure?
Google itself has done a lot of work on the second issue, pioneering robust computation techniques such as MapReduce and the use of large, distributed networks of unreliable but replaceable hardware, so that a computation can easily be restarted on a new server if it fails. Arista also seems to be tackling the problem. Again from the NYT article:
By building a new way to run networks in the cloud era, he says, “we have a path to having software that is more sophisticated, can be self-defending, and is able to detect more problems, quicker.”
Another possible approach was pointed out by John Ousterhout in a recent article on how to turn scale from an enemy to a friend, based on an analysis of Microsoft’s crash reporting system for MS Windows:
I have noticed four common steps by which scale can be converted from enemy to friend. The first and most important step is automation: humans must be removed from the most important and common processes. … The second step in capitalizing on scale is to maintain records; this is usually easy once the processes have been automated. … The third step is to use the data to make better decisions. … The fourth and final step is that processes change in fundamental ways to capitalize on the level of automation and data analysis.
By collecting massive amounts of data on system failures, bugs, and computer intrusions, and by using automated systems to analyze this data and to actively and adaptively probe the system for points of failure, we may be able to use the very size and complexity of these systems to make them more robust.
Since posting this, I’ve run across two more examples of fragility in centralized Internet services and infrastructure:
In October, the world-wide, private BlackBerry network suffered three days of service outages and delays, sparked by a problem with a single router:
[T]he problem started when a core switch, a high-volume variation of a router, failed within the unique network that RIM operates to manage BlackBerry data. That immediately caused a shutdown in Europe, Africa and the Middle East.
This clearly demonstrates the risks inherent in relying on single service providers with their own, private infrastructure.
Another potential issue is that large server farms are increasingly replacing routing via TCP/IP and ethernet (which are resilient but may not be optimally efficient) with software-defined networking (SDN), where routing rules are specified centrally and broadcast to each router.
This makes network routing more efficient under normal circumstances, but again increases centralization and potentially reduces the robustness of Internet services, especially if the central controller holding the routing rules were to go down:
In a software-defined network, a central controller maintains all the rules for the network and disseminates the appropriate instructions for each router or switch. That centralized controller breaks a fundamental precept of TCP/IP, which was designed not to rely on a central device that, if disconnected, could cause entire networks to go down. TCP/IP’s design has its roots in a day when hardware failures were much more common, and in fact part of the U.S. military’s Defense Advanced Research Projects Agency’s intent in sponsoring the original research behind the Internet was to develop Cold War-era systems that could continue to operate even when whole chunks of the network had been vaporized by a nuclear bomb.
These examples both lend weight to the points raised in the original post, above.