On Friday Google services Gmail and Google+ are went down world wide. Google's apps dashboard initially showed all Google services in the green, but it was updated at around 2:20 p.m. ET to show a service disruption for Gmail.
Gmail stopped working for both web and mobile apps as well as third-party clients such as Apple Mail, suggesting that the entire back-end (IMAP) service was affected.
Finally Google responded for their issue and released response via their blog. The Google services Gmail, Google+, Calendar and Documents down for users around the world was caused by a software bug.
Google Spokemen Ben Treynor, VP Engineering said, our work is now focused on (a) removing the source of failure that caused today’s outage, and (b) speeding up recovery when a problem does occur. We'll be taking the following steps in the next few days:
Gmail stopped working for both web and mobile apps as well as third-party clients such as Apple Mail, suggesting that the entire back-end (IMAP) service was affected.
Finally Google responded for their issue and released response via their blog. The Google services Gmail, Google+, Calendar and Documents down for users around the world was caused by a software bug.
Google Spokemen Ben Treynor, VP Engineering said, our work is now focused on (a) removing the source of failure that caused today’s outage, and (b) speeding up recovery when a problem does occur. We'll be taking the following steps in the next few days:
- Correcting the bug in the configuration generator to prevent recurrence, and auditing all other critical configuration generation systems to ensure they do not contain a similar bug.
- Adding additional input validation checks for configurations, so that a bad configuration generated in the future will not result in service disruption.
- Adding additional targeted monitoring to more quickly detect and diagnose the cause of service failure.
Google apologized for the outage, which, it reported, lasted 25 to 55 minutes and affected as many as 10% of users. The company also said they are in the process of putting systems in place to prevent any similar problems in the future.
If you’re interested in the technical explanation for what happened and how it was fixed, read on.
At 10:55 a.m. PST this morning, an internal system that generates configurations—essentially, information that tells other systems how to behave—encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users’ requests for their data to be ignored, and those services, in turn, generated errors. Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google’s Site Reliability Team. Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users’ service was restored.
No comments:
Post a Comment