What's New

Resources-WN

Will the Internet’s Increasing Speed, Size, and Complexity Lead to Catastrophe?

VGO associate Ted Belding talks about what he's excited about now. You can read his blog at www.beldingconsulting.com/blog.


There was a thought-provoking piece in the New York times a while back:

As the Arista founders say, the promise of having access to mammoth amounts of data instantly, anywhere, is matched by the threat of catastrophe. People are creating more data and moving it ever faster on computer networks. The fast networks allow people to pour much more of civilization online, including not just Facebook posts and every book ever written, but all music, live video calls, and most of the information technology behind modern business, into a worldwide “cloud” of data centers. The networks are designed so it will always be available, via phone, tablet, personal computer or an increasing array of connected devices.

Statistics dictate that the vastly greater number of transactions among computers in a world 100 times faster than today will lead to a greater number of unpredictable accidents, with less time in between them. Already, Amazon’s cloud for businesses failed for several hours in April, when normal computer routines faltered and the system overloaded. Google’s cloud of e-mail and document collaboration software has been interrupted several times.

“We think of the Internet as always there. Just because we’ve become dependent on it, that doesn’t mean it’s true,” Mr. Cheriton says. Mr. Bechtolsheim says that because of the Internet’s complexity, the global network is impossible to design without bugs. Very dangerous bugs, as they describe them, capable of halting commerce, destroying financial information or enabling hostile attacks by foreign powers.

It’s commonly known that the Internet itself was designed to be robust to failure because of its decentralized nature. The problem often seems to be that centralized systems have been built on top of the Internet over the years, such as the Domain Name System (DNS) and services run by a single entity such as Google, Facebook, Amazon, etc. It’s incredibly annoying when your DNS server or Gmail goes down: The global Internet as a whole may still be working, but that’s not much of a consolation when you yourself can’t access the Web or your email! (This is a big reason why Google really needs to provide more offline editing and storage features for its apps such as Google Docs, Calendar, and Gmail. Apple’s iCloud is much more robust to failure in this regard.)

From a complex systems standpoint, there are two separate issues to consider here, each of them important:

  1. How do we keep the Internet decentralized, and prevent it from becoming so dependent on a single service such as DNS, Amazon Web Services, or Facebook, that a failure of that service would be catastrophic?
  2. The issue raised in the NYT article: Given that these services are becoming complex enough and are used so much that bugs and failures are inevitable, how do we engineer them so that they’re robust to failure?

Google itself has done a lot of work on the second issue, pioneering robust computation techniques such as MapReduce and the use of large, distributed networks of unreliable but replaceable hardware, so that a computation can easily be restarted on a new server if it fails. Arista also seems to be tackling the problem. Again from the NYT article:

By building a new way to run networks in the cloud era, he says, “we have a path to having software that is more sophisticated, can be self-defending, and is able to detect more problems, quicker.”

Another possible approach was pointed out by John Ousterhout in a recent article on how to turn scale from an enemy to a friend, based on an analysis of Microsoft’s crash reporting system for MS Windows:

I have noticed four common steps by which scale can be converted from enemy to friend. The first and most important step is automation: humans must be removed from the most important and common processes. … The second step in capitalizing on scale is to maintain records; this is usually easy once the processes have been automated. … The third step is to use the data to make better decisions. … The fourth and final step is that processes change in fundamental ways to capitalize on the level of automation and data analysis.

By collecting massive amounts of data on system failures, bugs, and computer intrusions, and by using automated systems to analyze this data and to actively and adaptively probe the system for points of failure, we may be able to use the very size and complexity of these systems to make them more robust.

UPDATE:

Since posting this, I’ve run across two more examples of fragility in centralized Internet services and infrastructure:

In October, the world-wide, private BlackBerry network suffered three days of service outages and delays, sparked by a problem with a single router:

[T]he problem started when a core switch, a high-volume variation of a router, failed within the unique network that RIM operates to manage BlackBerry data. That immediately caused a shutdown in Europe, Africa and the Middle East.

This clearly demonstrates the risks inherent in relying on single service providers with their own, private infrastructure.

Another potential issue is that large server farms are increasingly replacing routing via TCP/IP and ethernet (which are resilient but may not be optimally efficient) with software-defined networking (SDN), where routing rules are specified centrally and broadcast to each router.

This makes network routing more efficient under normal circumstances, but again increases centralization and potentially reduces the robustness of Internet services, especially if the central controller holding the routing rules were to go down:

In a software-defined network, a central controller maintains all the rules for the network and disseminates the appropriate instructions for each router or switch. That centralized controller breaks a fundamental precept of TCP/IP, which was designed not to rely on a central device that, if disconnected, could cause entire networks to go down. TCP/IP’s design has its roots in a day when hardware failures were much more common, and in fact part of the U.S. military’s Defense Advanced Research Projects Agency’s intent in sponsoring the original research behind the Internet was to develop Cold War-era systems that could continue to operate even when whole chunks of the network had been vaporized by a nuclear bomb.

These examples both lend weight to the points raised in the original post, above.


Article in UM Anthropology Newsletter (pp. 10-12)
O’Shea’s PNAS paper
Robert Reynolds’s website

Advanced Computation and Archeology

Resources-WN

Visualizing Ancient Caribou Hunts with 21st century techniques

VGO associate Ted Belding talks about what he's excited about now. You can read his blog at www.beldingconsulting.com/blog.


I saw a great talk earlier this year ago at the University of Michigan.  Prof. Robert Reynolds (Wayne State University CS Dept) has built a computer model of caribou migration from around the year 10,000 BC. The animals moved between Michigan and Ontario along a submerged land bridge in Lake Huron. Reynolds worked in collaboration with archaeologist Prof. John O’Shea (University of Michigan) and others. The purpose of the model is to help the archaeologists predict where ancient hunters placed hunting blinds along the land bridge. O’Shea and his underwater archaeology team will then use submersibles and divers to check whether those structures existed. Using the prediction model before exploring will eliminate guesswork and save time and money for the project.

Reynolds combined an agent-based model of caribou movement with a cultural algorithm (an optimization method based on genetic algorithms) to model caribou herd movement. He and his team visualized the herds’ movement along a virtual image of the land bridge using 3D video game technology. He plans to extend the model to incorporate more varied plant and animal life. Then he'll add in more individual variation within the herds and model the activity of human hunters as well.

[conclusion about why this is great – can be personal – you've been working with this kind of computation for years and it always surprises you how creative people adapt the systems for amazing purposes – or can be something else., your call.]

This project is a great example of how computer simulation and optimization can be combined to help answer difficult real-world questions. In archaeology, evidence from the past is rare and difficult to find, and these techniques provide suggestions on where to look, saving time, effort, and money. The computer simulation can then be updated based on the actual evidence that is found. The same methodology can be used in other areas of science, engineering, and business, to answer questions and make predictions in fields such as computer security, logistics, finance, and weather forecasting.

More information:

Detroit Free Press article
Article in Michigan Daily
Article in UM Anthropology Newsletter (pp. 10-12)
O’Shea’s PNAS paper
Robert Reynolds’s website