viernes, 14 de enero de 2011

99.999% Reliable? Don’t Hold Your Breath


AT&T’s dial tone set the all-time standard for reliability. It was engineered so that 99.999 percent of the time, you could successfully make a phone call. Five 9s. That works out to being available all but 5.26 minutes a year.

Can we realistically expect that such availability will ever come to Internet services? Any given week, it seems, some well-known service suffers a shutdown. Recently, it was Hotmail and Skype. And Wikipedia, Facebook, Twitter, Foursquare and PayPal, among others, made the 2010 list of service interruptions compiled by Royal Pingdom, a company in Sweden that monitors the up time of Web services worldwide.

Internet computing, however, isn’t as unreliable as it may seem. After all, when was the last time you got to Google’s home page but couldn’t complete your search?

As more and more Web services companies acquire years of experience, we’ll see more consistent reliability — it’s just a matter of time and learning. Attaining Four-9s availability will become routine. That means available all but 52.56 minutes a year.

As for moving to 99.999, well, that may never come. “We don’t believe Five 9s is attainable in a commercial service, if measured correctly,” says Urs Hölzle, senior vice president for operations at Google. The company’s goal for its major services is Four 9s.

Google’s search service almost reaches Five 9s every year, Mr. Hölzle says. By its very nature, it is relatively easy to provide uninterrupted availability for search. There are many redundant copies of Google’s indexes of the Web, and they are spread across many data centers. A Web search does not require constant updating of a user’s personal information in one place and then instantly creating identical copies at other data centers.

Gmail has backup copies offline, but it normally uses two perfectly mirrored live copies — and that introduces the potential for trouble. Last year, Gmail’s availability was 99.984 percent. (This is the percentage of requested actions, such as sending off a message, that were successful.)

“Google doesn’t have the luxury of scheduled downtime for maintenance,” says Armando Fox, an adjunct associate professor in the College of Engineering at the University of California, Berkeley. Nor can it take down the service, he says, to install upgrades. “It is not uncommon for a place like Google to push out a major release every week,” he said, adding that such frequency is “unprecedented” for the software industry.

Computing services built for Internet scale have been pioneered by Amazon, too. It offers to other businesses Amazon Web Services, almost two dozen discrete categories of services, such as computing cycles or database software running on Amazon’s machines. These are the same behind-the-scenes computing services that the company uses to run Amazon.com.

One of those services, the Simple Storage Service, or S3, allows companies to store data on Amazon’s servers. “We talk of ‘durability’ of data — it’s designed for Eleven-9s durability,” says James Hamilton, a vice president for Amazon Web Services. That works out to a 0.000000001 percent chance of data being lost, at least theoretically.

As soon as a problem surfaces with an Internet service — anywhere — it will receive wide coverage in the technology media. But when a branch office of a nontech company has problems with its own e-mail server used for Microsoft Outlook, no one outside of that office is the wiser.

One thing that Google and other companies offering Web services have learned to do is to keep software problems at their end out of the user’s view. John Ciancutti, vice president for personalization technology at Netflix, wrote on the company’s blog in December about lessons learned in moving its systems from its own infrastructure to that of Amazon Web Services. He said Netflix had adopted a “Rambo architecture”: each part of its system is designed to fight its way through on its own, tolerating failure from other systems upon which it normally depends.

“If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond,” Mr. Ciancutti said. “We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.”

Netflix intentionally stresses its systems with software it calls its “Chaos Monkey.” It creates mischief like shutting down Netflix’s own subsystems randomly and challenging the other subsystems to adapt on the fly. Mr. Ciancutti writes, “If we weren’t constantly testing our ability to succeed” when experiencing subsystems’ failures, “then it isn’t likely to work when it matters most — in the event of an unexpected outage.”

MOST of the time, Internet users enjoy responsive service online — or a convincing illusion that all is well. And if they don’t, the problem is more likely to originate at their Internet service provider than in the Web service it connects with.

At my house, the Internet connection is flaky at times, so I really shouldn’t demand that my favorite Web sites have Five-9s availability. Perceived reliability is determined by the least reliable service in the chain. A home user’s Internet connection, with a laptop using Wi-Fi, would be available about 99.8 percent of the time, estimates Mr. Hölzle at Google, which equates to about 18 hours of cumulative downtime a year. So, he says, “if Google provided Five 9s, you wouldn’t know.”

No hay comentarios:

Publicar un comentario