What took Google hours to fix a glitch that its engineers “identified in 10 minutes”

Google layoffs 2025 tech titan offers buyouts to u s employees are layoffs next.jpg


What took Google hours to fix a glitch that its engineers “identified in 10 minutes”

Google has apologised for a major outage recently that impacted services running on Google Cloud across the globe recently. The widespread outage disrupted over 70 Google Cloud services, bringing down major third-party platforms like Cloudflare, OpenAI, and Shopify, as well as Google’s own products, including Gmail, Google Calendar, Google Drive, and Google Meet. In an incident report released late Friday (June 17), the company said, “Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API requests, impacting customers.” Google attributed the hours-long downtime to a series of flawed updates, particularly a new “quota policy checks” feature introduced in May. The feature, designed to evaluate automated incoming requests, was not adequately tested in real-world scenarios. This led to blank entries being sent across all Google Cloud data centers, triggering widespread system crashes.

What caused the Google Cloud outage

“On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging,” said Google in the Incident report.

What took Google engineers hours to fix the issue and the way forward

According to the company, Google engineers identified the issue within 10 minutes, but the outage persisted for seven hours due to overloaded systems in larger regions. Google admitted it failed to use feature flags, a standard industry practice that could have caught the issue during a gradual rollout. ” Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first. Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load. At that point, Service Control and API serving was fully recovered across all regions. Corresponding Google and Google Cloud products started recovering with some taking longer depending upon their architecture,” said Google in its Incident Report. To prevent future incidents, Google said it will overhaul its system architecture to ensure operations continue even if one component fails. The company also committed to auditing its systems and improving both automated and human communication channels to keep customers informed during issues.

Google apologises for the outage

“We deeply apologize for the impact this outage has had,” Google wrote in the incident report. “Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.”CEO of Google’s cloud unit Thomas Kurian, also posted about the outage in a Twitter post, saying “we regret the disruption this caused our customers.”





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *