Rice computer scientist Eugene Ng and his team said their solution will keep data on the fast track when failures inevitably happen.
Eugene Ng introduced ShareBackup, a strategy that would allow shared backup switches in data centres to take on network traffic within a fraction of a second after a software or hardware switch failure.
He will present a peer-reviewed paper on the work at the SIGCOMM 2018 conference in Budapest, Hungary. The paper is online and available for download .
Eugene Ng said the idea would solve a common annoyance among data professionals, scientists and everyone who relies on a network to deliver results day in and day out.
"A data network consists of servers and network switches", stated Eugene Ng, a professor of computer science and electrical and computer engineering. "Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware."
The usual response to a failed switch is to shunt the flow of data to another line. "Generally, the network has multiple paths for connecting servers so, just like if there's a closure on the highway, we'd drive around it. This is a conventional, natural approach that makes a lot of sense: You reroute around the failure to get where you need to go."
But sometimes that other road is congested and everything slows down. "Data centres aren't the internet; they're not about people surfing websites", Eugene Ng stated. "They're about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data centre."
Rather than the expensive option of installing redundant switches throughout a network, the Eugene Ng lab's strategy would put fast switches and software in strategic locations that could pick up the traffic from a failed switch in a microsecond. When that problem is resolved, the team's software makes the backup switch available to handle another failure.
The switch is fast enough - the failure-recovery time is 0.73 milliseconds, including latency from hardware and control systems - that most users would never know that part of the system had failed.
"The reality is that the fraction of devices that fail at any given time is very small, and most of these failures can be addressed by things like rebooting the device", Eugene Ng stated. "Sometimes the software gets screwed up and a simple power cycle will bring it back. These failures may also not last long. These are the characteristics we're trying to exploit", he stated. "Because of that, we can get away with having very few devices back up a large number of devices."
Eugene Ng said ShareBackup could save data centers time and money not only by maintaining full bandwidth but by also helping to analyze problems, including misconfigurations that commonly lead to network failure.
"Part of our work is to help data centers figure out what went wrong in the network", he stated. "Once the backup is activated, you can take the failed device out of the production network and test it to identify which component caused the problem."
"Now, if we take two devices out and can't figure out which went bad, both need to be replaced", he stated. "It's very likely only one of the devices is having the problem. Our software can diagnose these devices in a semiautomatic manner, and if one of the parts is good, it can be reinstated."
Lead authors of the paper are Rice graduate student Dingming Wu and alumnus Yiting Xia, now a computer scientist at Facebook. Co-authors are Rice graduate students Xiaoye Steven Sun, Xin Sunny Huang and Simbarashe Dzinamarira.
The National Science Foundation supported the research.