Tag: Outages

Apple Store Down for Updates

The Apple store at store.apple.com is down for updates. At first, there was no message appearing, but now Apple is saying:

“We’ll be back soon. We are busy updating the store for you and will be back shortly.”

Searching Twitter to see what people are saying about it, I came across this as the top result:

@SeanSmithSucks
Sean SmithWalked into the Apple store to see a kid pull another kids trouser straight down in front of everyone! I laughed out loud. 23 hours ago via Twittelator · powered by @socialditto

Good job, New Twitter Search!

Then there were some comments about the outage:

@LiamRas
LiamOooo. The Apple store is down. New products on the way? #Wishfulthinking 1 hour ago via web · powered by @socialditto

@engadget
EngadgetYeah, the Apple Store is down. It’s okay though — you can still start your Wednesday. 1 hour ago via web · powered by @socialditto

@microfun
Carsten KoepOh, Apple Store is down: http://t.co/0dBIqRo – Can’t wait to see what’s new this time! 🙂 1 hour ago via Google+ agent · powered by @socialditto

@phranck
phranckuhhhh the online Apple Store is down for updates! 2 hours ago via Twitter for iPad · powered by @socialditto

@IsTheStoreDown
IsTheAppleStoreDownThe Apple Store is down! The average Store downtime is about 2 hours and 39 minutes. http://t.co/5cQs7Tu 2 hours ago via IsTheAppleStoreDown.de · powered by @socialditto

@RobLoBue
Robert Lo BueCan’t seem to get into UK Apple Store. Down but no message? http://t.co/xuwjzW8 2 hours ago via Twitter for Mac · powered by @socialditto

August 17, 2011
Facebook API Down, Affecting Farmville, Words With Friends and More

Facebook is currently dealing with some problems with API response time.

Around 4 a.m. this morning, the API response time shot up dramatically and Facebook has been experiencing problems ever since. It appears that the worst part is over, but things have yet to return to normal.

The Facebook platform live status page lists the current status as “Network Latency” saying that “we’re currently investigating an issue with API response times. We will update the status when we have further information.”

Here are some graphs of the problem, the first showing response time as compared to average. The second shows error count as compared with yesterday.

Of course this is affecting anything that runs on Facebook’s API including popular social games like FarmVille and the brand new Facebook edition of Words with Friends. Some applications aren’t working, some are simply working very slowly. For instance, it just took around a minute and a half to load my current games list on Words with Friends.

August 11, 2011
Yahoo Mail Down (For Some) – Users Considering Gmail

Some users of Yahoo services, and Yahoo Mail in particular are currently experiencing outages. Many have taken to Twitter to voice their complaints.

An apparently official statement from Yahoo on the matter is making the rounds on Twitter as well. It says, “”Some Yahooservices are currently inaccessible..We are working to correct the issue.”

Mashable obtained another, slightly longer statement saying, “Some Yahoo services are currently inaccessible to some users in certain locations. We are working to correct the issue and restore all functionality immediately. We know that this may have caused some inconvenience and we apologize to our users who might be affected.”

@smorris777
smorris777yes, Y!Mail is down .. except on iOS (I can access via my iPad) http://t.co/de9xygZ via @cnet 20 minutes ago via Tweet Button · powered by @socialditto

@MiddleSeatView
Christina Saull Yahoo mail has been AWFUL lately RT @mashable: Yahoo Mail Suffers Outage for Some Users – http://on.mash.to/oQUdDq 21 minutes ago via TweetDeck · powered by @socialditto

@lauranav
LAURANAVThumbs down, New Yahoo Mail. You keep pushing me to gmail, which I’ve only avoided because I am lazy. 39 minutes ago via web · powered by @socialditto

@Petabites
Petabitesgot a Yahoo! mail #fail for the last hour. Webmail site down at the authentication stage of login for mysef and many others. 53 minutes ago via web · powered by @socialditto

@rudiuyee
rudi kurniawanYahoo Mail is down for some users: What gives, YMail? Are you trying to lose users to Gmail? http://bit.ly/nqwAfo 54 minutes ago via twitterfeed · powered by @socialditto

Yahoo Mail users that haven’t been able to access the site, have been getting a message telling them that the webpage isn’t available.

People seem to be searching Google, hoping to find the site when Yahoo can’t (I’m guessing unsuccessfully). Top Google Trends at the moment include: “yahoo.com sign in,” “yahoo.com mail,” and “www.yahoo.com mail.”

I have to assume Google is getting a kick out of this, considering its recent Gmail push (which is probably more a Google+ push in reality). Although, it’s not as if Gmail has been completely immune from outages itself.

August 3, 2011
Netflix User Woes Continue to Pile Up

The good times just keep rolling for Netflix users. The company recently launched a major redesign of its site, which has been met with an incredible amount of hostility from users, many of which have found little if any improvement, and are upset with the removal of some features. I’ve even quite a few complaints of motion sickness, due to the scrolling movie images that come with the new design.

Netflix was also recently named on a list of mobile apps with security vulnerabilities. More on that here.

On Friday, the company announced that it had lost its streaming Sony movies (temporarily). Pauline Fischer, VP of Content Acquisition, wrote on the company blog, “You may have noticed that Sony movies through StarzPlay are not currently available to watch instantly. This is the result of a temporary contract issue between Sony and Starz and, while these two valued partners work through their differences, we hope you are enjoying the wide variety of new movies and TV shows added daily. In the next few weeks, look for great movie titles like The Fighter,Skyline and Iron Man 2 as well as the first four seasons of Mad Men, all to watch instantly.”

Even if the titles come back soon, this does show how Netflix’s service is vulnerable to deals among third parties that are out of its control.

Last night, the site was hit with an outage. The website and streaming services of Netflix were down for at least three hours. They’re back up now, but still, things are quite back to normal yet for everyone. The last two tweets from the @Netflixhelps support Twitter account were:

@Netflixhelps
NetflixHi everyone, we’re working hard to bring the website and watch instantly back up. We’ll post again when everything is back online. 9 hours ago via TweetDeck · powered by @socialditto

@Netflixhelps
NetflixThe website and watch instantly are online, but some are still having issues with activating devices. We’ll have that fixed ASAP too. 6 hours ago via TweetDeck · powered by @socialditto

We’ve seen a lot of comments from Netflix users since the redesign, saying they would be canceling their accounts. The events, which have taken place since, have probably not done much to help retain these people. The good news for Netflix, is that people are watching more streaming content and less TV.

June 20, 2011
The Return of Blogger

Fear not, those of you who haven’t migrated over to the WordPress environment because Blogger.com and all of its services is back. After a substantial outing, Google’s blog service is operational once again, and deleted posts have begun returning.

While it may not be as robust as other clients, Blogger’s user base is still considerable, so much so, in fact, Alexa refers to Blogger.com as the fifth most popular site in the world based on three-month observation. Granted, this includes traffic to Blogger.com blogs, but the point remains – there are a lot of Internet users who access Blogger.com content.

As for the outing, it lasted for 20.5 hours and was caused by apparent data corruption. Some posts and comments were removed, as well as leaving Blogger.com members without any editing or content creation capabilities. The Blogger Buzz post has further details:

…bloggers and readers may have experienced a variety of anomalies including intermittent outages, disappearing posts, and arriving at unintended blogs or error pages. A small subset of Blogger users (we estimate 0.16%) may have encountered additional problems specific to their accounts. Yesterday we returned Blogger to a pre-maintenance state and placed the service in read-only mode while we worked on restoring all content: that’s why you haven’t been able to publish.

Over at the Blogger Status blog, a post from 06:07 PDT reveals the process of restoring content was underway. At 10:32 PDT, Blogger.com was given the “all clear/all systems normal” status, meaning posting could resume like normal.

From my (very) brief interaction with the Blogger.com backend, the first thing noticeable was it takes a few moments to connect. While this could be a connection issue on my end, nothing else is connecting slowly, expect Twitter. Apparently, the maintenance is on-going, but the service is indeed functional.

May 13, 2011
Amazon Talks Preventing Future Outages, Says It’s Sorry

Amazon has finally released a big statement regarding the recent server disruptions it experienced, which led to some sites having massive losses in service, and people to question the reliability of the cloud.

When I say that the statement is “big” I mean it. I will post a few choice snippets here. First, a quick summary at the beginning:

The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as “stuck” volumes. This caused instances trying to use these affected volumes to also get “stuck” when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.

It then gets into an overview of the EBS system, and technical details of the outage and recovery, as well as the impact on the Amazon Relational Database Service (RDS). Then it talks about prevention, which is probably the most important takeaway, considering businesses rely on Amazon to stay up and running:

The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future. However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future.

We will be making a number of changes to prevent a cluster from getting into a re-mirroring storm in the future. With additional excess capacity, the degraded EBS cluster would have more quickly absorbed the large number of re-mirroring requests and avoided the re-mirroring storm. We now understand the amount of capacity needed for large recovery events and will be modifying our capacity planning and alarming so that we carry the additional safety capacity that is needed for large scale failures. We have already increased our capacity buffer significantly, and expect to have the requisite new capacity in place in a few weeks. We will also modify our retry logic in the EBS server nodes to prevent a cluster from getting into a re-mirroring storm. When a large interruption occurs, our retry logic will back off more aggressively and focus on re-establishing connectivity with previous replicas rather than futilely searching for new nodes with which to re-mirror. We have begun working through these changes and are confident we can address the root cause of the re-mirroring storm by modifying this logic. Finally, we have identified the source of the race condition that led to EBS node failure. We have a fix and will be testing it and deploying it to our clusters in the next couple of weeks. These changes provide us with three separate protections against having a repeat of this event.

Then, there’s plenty more about the impact to multiple availability zones and recovery, before Amazon addresses another big element of this story, which has come under significant fire from the media: the company’s lack of communication on the whole matter (something that seems to be a trend in the tech world these days):

In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again. Most of the AWS team, including the entire senior leadership team, was directly involved in helping to coordinate, troubleshoot and resolve the event. Initially, our primary focus was on thinking through how to solve the operational problems for customers rather than on identifying root causes. We felt that that focusing our efforts on a solution and not the problem was the right thing to do for our customers, and that it helped us to return the services and our customers back to health more quickly. We updated customers when we had new information that we felt confident was accurate and refrained from speculating, knowing that once we had returned the services back to health that we would quickly transition to the data collection and analysis stage that would drive this post mortem.

That said, we think we can improve in this area. We switched to more regular updates part of the way through this event and plan to continue with similar frequency of updates in the future. In addition, we are already working on how we can staff our developer support team more expansively in an event such as this, and organize to provide early and meaningful information, while still avoiding speculation.

We also can do a better job of making it easier for customers to tell if their resources have been impacted, and we are developing tools to allow you to see via the APIs if your instances are impaired.

Finally, the apology:

Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.

The company was good enough to give affected customers a 10-day credit (equal to 100% of usage of EBS volumes, EC2 instances and RDS database instances that were running in the affected availability zone).

April 29, 2011
Amazon Outage Casts Shadow Over Cloud Perception

Amazon recently suffered some problems with some of its servers, which left some sites with large hiccups in their services. Amazon’s Elastic Compute Cloud (EC2) service had some issues, primarily in Virginia. Among the sites affected were Foursquare, Quora, Reddit, and Hootsuite.

Amazon has said that .07% of the data not able to be fully recovered, according to several reports. “We have completed our remaining recovery efforts and though we’ve recovered nearly all of the stuck volumes, we’ve determined that a small number of volumes (0.07% of the volumes in our US-East Region) will not be fully recoverable,” Amazon is quoted as saying.

The company has been letting the companies affected know. It’s unclear what all companies are actually affected.

The whole incident hasn’t been good for the perception of cloud computing in general. After all, if something like this could happen, what’s to stop all kinds of similar incidents from happening in the past. Reliance on others for important data is a liability.

Look at Twitter’s inability to stay operational for all users at all times. What if Twitter was hosting a great deal of your company’s information. Business have come to rely on Twitter for various purposes, yet the site is often plagued with downtime. It’s just another example of reliance on third-parties for business-critical functions.

The whole Playstation Network debacle hasn’t done anything to help the perception of cloud computing either.

Amazon recently launched its cloud storage service for consumers. Amazon will have a lot more information on its servers than just businesses, as people store their music collections and other files.

As far as business, it might be wise to have a backup plan in case you can’t rely 100% on a third party. InformationWeek has an interesting piece about the need for failover planning.

Bizo, a company that depended on Amazon, “resorted to a practice that many observers were left wondering why Amazon itself hadn’t adopted,” writes InformationWeek’s Charles Babcock. “- the ability of a system in one data center to be shifted to another in a separate, geographic location.

Everything on Amazon’s status dashboard is currently listed as “operating normally”.

April 28, 2011
Amazon EC2 Server Issues Cause Web Havoc

Amazon’s Elastic Compute Cloud (EC2) service has caused a bit of disarray around the web as servers have failed.

Among the sites/services affected are Foursquare, Quora, Reddit, and Hootsuite (ht: The Next Web).

The issues appear to be coming out of Virginia. Amazon is providing updates on its Amazon Webservices Service Health Dashboard. All of the issues come from that location. One sequence of udpates attached to Amazon Cloudwatch reads:

2:26 AM PDT We are working on restoring connectivity to a small number of EC2, EBS, and RDS resources in multiple availability zones in the US-EAST-1 region. While we restore connectivity, CloudWatch metrics for those resources will be delayed.

3:04 AM PDT We are continuing to see connectivity issues impacting EC2, EBS, and RDS resources in multiple availability zones in the US-EAST-1 region. While we restore connectivity, CloudWatch metrics for those resources will be delayed. We continue to work towards resolution.

4:47 AM PDT CloudWatch metrics are delayed for some EBS and RDS resources in the US-EAST-1 region. The delays began at 12:55AM PDT. We have isolated the impact to a single availability zone, and are working towards a full resolution.

Another on Amazon Relational Database Service says:

1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.

2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.

3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.

4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.

5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances.

On AWS CloudFormation, it says:

3:29 AM PDT We are experiencing delays in creating and deleting stacks that include EBS, EC2 and RDS resources in multiple availability zones in the US-EAST-1 region. Existing stacks are not impacted.

5:10 AM PDT CloudFormation stack creation and deletion is delayed for stacks containing EC2, EBS and RDS resources in the US-EAST-1 region. The delays began at 12:55AM PDT. We have isolated the impact to a single availability zone, and are working towards a resolution.

Finally, on AWS Elastic Beanstalk, it says:

3:16 AM PDT We can confirm increased error rates impacting Elastic Beanstalk APIs and console, and we continue to work towards resolution.

4:18 AM PDT We continue to see increased error rates impacting Elastic Beanstalk APIs and console, and we are working towards resolution.

The rest of the list comes with the “service is operating normally” status.

Foursquare and Reddit seem to be back on track, but Quora and Hootsuite are still down at the time of this writing.

I wonder how much money is being lost based on Amazon’s server issues.

April 21, 2011
Twitter Down For Some Users

Twitter users are having some problems accessing the site. What else is new right? We’re not seeing the usual fail whale, but the robot pictured above this time.

Various Twitter apps appear to be working (we tested iPhone/Android/Mac), but Twitter.com itself is having issues.

Not much info has been made available in term of details, but Twitter did have this to say on the Twitter Status blog:

You may experience some problems loading twitter.com and with Twitter clients. We are aware of the problem and are taking action.

That was about an hour ago. It hasn’t been that long since I was able to use Twitter.com, but apparently it’s been happening that long for some. If we receive additional info, we’ll update.

Twitter.com access did resurface once for me for just a second, before fading again.

I don’t think this robot is quite as charming as the Fail Whale. Anyone have t-shirts or tattoos of this guy yet?

Update: As of 4:46PM, Twitter.com appears to be back in business – at least for those of us around here. The status blog has not been updated, however, so I’m guessing the issue isn’t completely resolved.

Update 2: Twitter posted the following to its status blog: This issue is resolved, access to all features for all users is restored.

March 16, 2011
Google Loses Gmail Users’ Email, Says It Will Be Back in Hours

Some Gmail users have had some problems over the last day or so. Email messages have gone missing, along with labels, themes and other personalized settings. Google is working on fixing this.

While only a small percentage of Google users was affected, that still accounts for thousands of users – tens of thousands, according to ComputerWorld, who estimates the number to be about 35,000.

TheNextWeb shares this statement from Google:

"A very small number users are having difficulty accessing their Gmail accounts, and in some cases once they’re in, trouble viewing emails. This is affecting less than .08% of our Gmail user base, and we’ve already fixed the problem for some users. Our engineers are working as quickly as possible and we hope to have everything back to normal as soon as possible. We’re very sorry for the inconvenience."

Google’s Andrew Kovacs has since tweeted:

re Gmail issue: affected 0.02% of users not 0.08%, restored access for 1/3, remaining 0.013% being restored on ongoing basis,all w/in 12 hrsless than a minute ago via webAndrew Kovacs
akovacs

According to Seth Weintraub at Fortune, some Gmail users have been without their messages for over 24 hours.

The latest message on the Apps Status Dashboard from 4:15PM Eastern, says:

Google Mail service has already been restored for some users, and we expect a resolution for all users within the next 10 hours. Please note this time frame is an estimate and may change.

The remaining 0.012% of accounts are being restored on an ongoing basis.

Google has often boasted having tremendous uptime for Gmail, and while only a small percentage of users appears to have been affected, the widely publicized flub is sure to leave an impression on users.

Of course, there’s barely a day that goes by when that I don’t have some kind of issue with Twitter (and I know I’m not alone), yet many of us keep using that. As long as Google gets everything recovered, as they seem pretty confident that they will do, this will probably be largely forgotten in no time – especially considering similar incidents from competitors.

February 28, 2011
Google Commits to 99.99% Uptime for Google Apps

Google has announced that it has made some changes to its service level agreement (SLA) for Google Apps, to reduce the possibility that users will experience any downtime. The company says it has eliminated maintenance windows from the SLA, so Google will never plan for users to be down when they’re upgrading services or maintaining their systems.

"People expect email to be as reliable as their phone’s dial tone, and our goal is to deliver that kind of always-on availability with our applications," says Matthew Glotzbach, Google’s Enterprise Product Management Director. "Going forward, all downtime will be counted and applied towards the customer’s SLA."

Google has also made changes to the SLA so that any intermittent downtime is counted. "Previously, a period of less than ten minutes was not included," explains Glotzbach. "We believe any instance that causes our users to experience downtime should be avoided — period."

According to Google, Gmail was available 99.984% of the time in 2010 for both businesses and consumers, which works out to about 7 minutes of downtime per month on average. Glotzbach says this represents the accumulation of small delays of a few seconds.

"Seven minutes of downtime compares very favorably with on-premises email, which is subject to much higher rates of interruption that hurt employee productivity," he says, providing the following graph:

As you may know, Microsoft’s Hotmail recently experienced some issues, which led to our conversation about users’ dependency and vulnerability when it comes to using web services- particularly for email.

At the end of December, some Hotmail users lost messages and folders from their accounts temporarily. On January 6, Microsoft explained what happened, and said it had all been recovered, though for those who didn’t sign into their accounts between the time of the incident and the time the account was restored, any messages sent to their accounts during that time would have bounced.

While I’d be interested to see another chart like the one above, comparing Gmail with other webmail providers like Microsoft’s Hotmail, Yahoo Mail, etc., Gmail’s case is looking pretty good compared to the on-premises email.

January 14, 2011
Communication Breakdown: When Email Goes Down

At the end of December, some Hotmail users experienced problems with their email – it was gone. Messages and folders went completely missing from their accounts. Luckily, for those users, the emails came back.

Microsoft says it recovered 100% of email and folders for the accounts affected. Unfortunately, for those who didn’t sign into their accounts between the time of the incident and the time the account was restored, any messages sent to their accounts during that time would have bounced.

Microsoft has apologized for the incident, but it can’t have been very good for the service’s reputation with users, particularly considering there plenty of other options out there. Hotmail has hundreds of millions of users and competitors like Yahoo and Google will be happy to take as many of them as possible.

The whole thing makes you stop and consider how much users are relying on third-parties for essential communication. Who’s to say people didn’t miss extremely important messages during that period?

Microsoft’s Mike Schackwitz details exactly what happened on the company’s Inside Windows Live Blog:

In Hotmail, one way we monitor the health of the email service is through automated tests. We set up a number of accounts with different configurations, and then use automated tests to log into these accounts, simulate normal user activity and behavior, and report when errors are found. We use scripts to create and delete these test accounts in bulk. The way we delete a test account is to remove its record from a group of directory servers that route users and incoming mail to the correct mailbox.

On December 30th, we had an error in a script that inadvertently removed the directory records of a small number of real user accounts along with a set of test accounts. Please note that the email messages and folders of impacted users were not deleted; only their inbox location in the directory servers was removed. Therefore when they logged in, a new mailbox was automatically created for them on a new storage server that didn’t contain their old messages and folders. This is why the accounts received the “Welcome to Hotmail” message.

Read the post for further explanation.

It’s not like Microsoft is the first provider to experience downtime. Google has always bragged about its Gmail uptime (and has a dashboard where users can monitor it), but it’s gone down on occasion too. Facebook is trying to redefine email and electronic communication with its social inbox, but Facebook recently went down for a lot of users itself. Twitter is no replacement for email, but a lot of people communicate with it frequently, and that fail whale appears fairly frequently.

Microsoft says it’s updating its infrastructure, and changing its alert process, as well as its feedback process to take preventative action against future incidents. Unfortunately, and this goes for any company, it’s usually the issues you don’t think to prevent that end up costing people.

January 7, 2011
Skype Talks Outage and Prevention of Future Outages

Last week, millions of Skype users lost their connections and experienced various other issues with the service. Though the outage didn’t last much longer than a day, the sheer number of those affected created a huge blunder for the company.

It didn’t take long for Skype to go on damage control mode, however. CEO Tony Bates himself jumped on the Skype blog a few times to provide updates, explanations, and of course apologies:

Now, Skype CIO Lars Rabbe has chimed in with a "post-mortem on the Skype outage". He details the cause of the failure, how they recovered the service, and most importantly, what the company is doing to prevent such a thing from happening again.

If you’re interested in the technical explanation of what happened, simply refer to Rabbe’s post. To put it in the simplest possible terms, which he did, in the intro, Skype’s P2P network became unstable and suffered a "critical failure."

As far as prevention, Rabbe says the company will continue to examine its software for potential issues, and provide "hotfixes" where appropriate, either for download or automatic delivery to users. "We will also be reviewing our processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software," he says. "We believe these measures will reduce the possibility of this type of failure occurring again."

"Second, we are learning the lessons we can from this incident and reviewing our processes and procedures, looking in particular for ways in which we can detect problems more quickly to potentially avoid such outages altogether, and ways to recover the system more rapidly after a failure," he adds. "Third, while our Windows v5 software release was subject to extensive internal testing and months of Beta testing with hundreds of thousands of users, we will be reviewing our testing processes to determine better ways of detecting and avoiding bugs which could affect the system."

He says that Skype will also continue to invest in capacity and resilience, with an investment program already in existence.

December 29, 2010
Millions Lose Their Skype Connections [Updated]

Update 3: Now CEO Tony Bates has provided further explanation and apologies.

Update 2: Parkes provided another update on the issue today:

An update on the downtime which has been affecting many of you around the world: the ability of one Skype user to find another relies on what we call ‘supernodes’, and yesterday, a number of these failed due to a software issue, which we’ve now identified. Our engineers are working to resolve the problem.

Millions of you are already reporting that you can now sign in to Skype normally, and we estimate that there are already almost 5 million people online. As a guide, this is around 30% of what we’d expect at this time of day – and that number is increasing all the time. Unfortunately, it’s not possible for us to predict on an individual level when you’ll be able to sign in again, and we thank you for your patience in the meantime.

It’s worth noting that our enterprise product, Skype Connect, is working normally, though Skype Manager and our other web-based functions will continue to stay offline for a little longer. Additionally, features like group video calling will take longer to return to normal.

Update: Parkes says on the official Skype blog:

Skype isn’t a network like a conventional phone or IM network – instead, it relies on millions of individual connections between computers and phones to keep things up and running. Some of these computers are what we call ‘supernodes’ – they act a bit like phone directories for Skype. If you want to talk to someone, and your Skype app can’t find them immediately (for example, because they’re connecting from a different location or from a different device) your computer or phone will first try to find a supernode to figure out how to reach them.

Under normal circumstances, there are a large number of supernodes available. Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype. As Skype relies on being able to maintain contact with supernodes, it may appear offline for some of you.

What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as fast as they can, which should gradually return things to normal. This may take a few hours, and we sincerely apologise for the disruption to your conversations. Some features, like group video calling, may take longer to return to normal.

Original Article: Skype has been having troubles with sign-ins today, and millions of people have lost their connections on the service. Little is known about the cause of all of this, but just do a Twitter search for Skype and you’ll see that a lot of people are less than thrilled.

Skype is on it, at least. The company posted the following tweets on Twitter:

Some of you may have problems signing in to Skype – we’re investigating, and we’re sorry for the disruption to your conversationsless than a minute ago via CoTweetSkype
Skype

Our engineers and site operations team are working non-stop to get things back to normal – thanks for your continued patienceless than a minute ago via CoTweetSkype
Skype

Peter Parkes with Skype’s communications team told ReadWriteWeb, "If you’re already signed in, you should be able to continue using Skype as normal."

Some people are apparently able to sign in to the service, but have lost their contacts.

The outage has already caused a big blow to Skype’s reputation. Influential tech blogger Om Malik notes, "The outage comes at a time when Skype is starting to ask larger corporations for their business. If I am a big business, I would be extremely cautious about adopting Skype for business, especially in the light of this current outage."

There have been a lot of high profile outages lately. Some have laste longer than others. It will be interesting to see how Skype handles the damage control with this one.

December 23, 2010
Update: Facebook is Up…Kind Of (After Downtime)

Update 3: Facebook has now tweeted saying, "Facebook is available again after being down for a brief period. We apologize for the inconvenience."

Guess it was page related after all.

Update 2: It seems, as a commenter points out, that Facebook is still down for some people. Tweets continue to roll in to that effect. (5pm Eastern)

People are taking jabs at Mark Zuckerberg for being person of the year now.

Some are saying they were seeing the new page designs before the outage (though we never saw it on ours), so something likely went wrong with the rollout.

Update: It’s back up. It’s rare for FB to go down like that, but to the company’s credit, the outage didn’t last too long, especially compared to MasterCard’s recent two-day outage.

For the record, I’m not seeing this new Page design yet either.

Original Article: Facebook appears to be down, at least for many people. We’re seeing a lot of chatter about it on Twitter.

As Twitter user DJDearest says, "Facebook is down. Productivity at offices across the world just went up 356%."

We don’t know if it has anything to do with the recent "Anonymous" attacks, but Facebook did recently take down their Facebook Page. Although, Twitter took down their account too, and given Twitter’s reputation for downtime, you would think it would be down also.

Here is a sample of what some other people have to say about it on Twitter:

It’s very unusual for Facebook to go offline like this, unlike say Twitter. No Fail Whales to look at though.

Mashable speculates that this is tied to a rollout of redesigned brand pages.

I wonder how much money Facebook is losing while this is going on.

Touch.facebook.com is working.

December 16, 2010
Facebook Explains Outage

Facebook was down for over 2.5 hours for some users, according to a post from the company. A post in Facebook’s engineering notes says:

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.

You can see more of the technical details here. Facebook has turned off the system that attempts to correct configuration values, and is exploring new designs for it.

I attended a screening of the movie The Social Network last night, and Mark Zuckerberg’s character stressed how much downtime would hurt the reputation of the site, as he was getting it launched. I thought that was kind of funny, considering the timing.

September 24, 2010