In the news...

In the news… it happened again!

Microsoft Azure experienced a significant outage this week—just days after the AWS incident—and the ripple effects were felt well beyond tech circles. For food service, grocery, agribusiness, and logistics, the pattern is now unmistakable: when a hyperscaler stumbles, the whole menu shakes.

Who felt it in food & supply chains

During the Azure event, major retailers and travel operators reported issues (payments, portals, check-in, digital passes), with downstream effects on store operations and customer flows. Reports cited impacts to brands like Kroger and Starbucks, airlines, and even government portals in New Zealand—demonstrating how the same edge services touch both commerce and public services.

Restaurants & grocers: Portal logins, payment authorizations, loyalty apps, and scheduling tools glitched—slowing lines and forcing manual workarounds.
Distributors & logistics: Edge delivery hiccups and DNS faults can break route optimizers, label/ASN services, and proof-of-delivery syncs—delaying loads and backhauls. (See similar dynamics from the AWS outage a week earlier.)
Farms & processors: When identity, storage, or telemetry endpoints fail, AI forecasting and temperature/IoT data gaps appear—forcing conservative decisions (hold or waste) until data integrity returns. (This was a hallmark in prior outages.)

Major Global Cloud and Service Outages (2021 – 2025)

2021

AWS Outage – Dec 7: Region: us-east-1. Duration: ~7 hours. Cause: Network congestion within internal devices. Impact: Many AWS services (EC2, DynamoDB, API Gateway) and downstream apps (Netflix, Disney+, Slack) were unavailable. [CNBC report]
Fastly CDN Outage – Jun 8: Cause: Software bug triggered by a customer configuration. Impact: Took down major sites worldwide (Amazon, Reddit, GitHub). [Fastly engineering summary]
Facebook/Meta Outage – Oct 4: Cause: BGP (Border Gateway Protocol) misconfiguration. Impact: Facebook, Instagram, and WhatsApp offline for ~7 hours. [The Guardian]

2022

Rogers Communications Outage – Jul 8: Scope: National telecom failure in Canada. Impact: 12 million+ users offline; ATMs and emergency services disrupted. [Reuters]
Azure & Google Cloud Partial Outages: Various regional incidents caused by network and configuration errors prompted renewed focus on resilience testing. [ZDNet]

2023

AWS Outage – Jun 13: Region: us-east-1. Duration: several hours. Cause: Networking subsystem overload. Impact: Slack, Asana, and many others affected. [CNN]
Optus Telecom Outage (Australia) – Nov 8: Impact: 10 million+ users offline; 400 k businesses affected. Cause: Software update error cascading through routing systems. [ABC News Australia]
Microsoft 365 / Teams / Outlook Incidents: Multiple service disruptions throughout the year due to configuration rollouts. [BBC News]

2024

CrowdStrike Outage – Jul 19: Cause: Faulty Falcon sensor update to Windows systems. Impact: Global downtime of Windows endpoints, major airline and healthcare disruptions. [CISA Alert] | [BBC News]
Google Cloud Outage – Oct 10: Cause: Storage service failure in multiple regions. Impact: BigQuery and Gmail latency; intermittent errors for developers and retail users. [The Register]

2025

AWS Outage – Oct 20: Cause: DNS automation bug tied to DynamoDB in us-east-1, creating invalid records that cascaded into global failures. [The Guardian] | [Wired]
Microsoft Azure Outage – Oct 29: Cause: Inadvertent configuration change affecting Azure Front Door and global edge delivery, impacting Microsoft 365, Xbox, and other services. Impact: Widespread business interruptions across multiple regions. [AP News] | [ZDNet]
IBM Cloud Partial Outage – Sep 2025: Cause: Network routing failure in Dallas data center. Impact: Limited latency and timeout issues for financial and manufacturing clients. [CRN]

What’s the common thread?

Small triggers, huge impact. The last four years show recurring patterns: configuration changes, DNS/control-plane fragility, and tightly coupled dependencies (auth, routing, messaging) that spread failure far beyond the origin. Even when core services return fast, the long tail—cache rebuilds, retries, backlogs—keeps operators impaired for hours.

Recommendations for operators (food service, grocery, ag, logistics)

Map your blast radius: Identify every external dependency (DNS, auth, payment, label printing, edge/CDN). Note which ones share a single cloud backbone.
Design for graceful degradation: Offline order capture with delayed auth; store-and-forward for POS, scale, and handhelds; deferred sync for inventory and temperature telemetry.
Separate control planes: Keep ops comms and runbooks reachable when cloud identity is down (secondary comms, out-of-band docs, alternate DNS).
Active-active across regions/providers: Critical customer-touching functions (payments, menus, loyalty, pickup/dispatch) should survive a regional cloud failure.
Quarterly outage drills: Time your RTO/RPO. Practice paper-mode for front of house; rehearse dispatch without the TMS; test label printing and receiving without live services.
Vendor transparency & SLAs: Ask SaaS/PaaS partners which cloud/edge they ride, what their failover looks like, and how credits/penalties work. (Recent Azure/AWS events justify stricter terms.)

Bottom line

It wasn’t just AWS. It wasn’t just Azure. And it won’t be the last. Treat cloud incidents like weather: forecastable, inevitable, survivable—if you build for it. The winners aren’t those who avoid every outage; they’re the teams who keep serving meals, moving loads, and cooling product while the internet catches its breath.

The biggest question for now: who is next?

Creative Cooking with AI

Search This Blog