
Asana: February 5 & 6
- Duration: Two consecutive outages, with the second lasting approximately 20 minutes
- Symptoms: Service unavailability and degraded performance
- Cause: A configuration change overloaded server logs on February 5, causing servers to restart. A second outage with similar characteristics occurred the following day.
- Takeaways: “This pair of outages highlights the complexity of modern systems and how it’s difficult to test for every possible interaction scenario,” ThousandEyes reported. Following the incidents, Asana transitioned to staged configuration rollouts.
Slack: February 26
- Duration: Nine hours
- Symptoms: Users could log in and browse channels, but experienced issues sending and receiving messages.
- Cause: Issues with a maintenance action in their database systems caused an overload of heavy traffic directed at the database.
- Takeaways: “At first glance, everything looked fine at Slack—network connectivity was good, there were no latency issues, and no packet loss,” according to ThousandEyes. Only by combining multiple diagnostic observations could investigators determine the true source was the database system, later confirmed by Slack.
X: March 10
- Duration: Several hours with various service downtimes
- Symptoms: The platform appeared “down,” with users experiencing connection failures similar to a distributed denial-of-service (DDoS) attack.
- Cause: Network failures with significant packet loss and connection errors at the TCP signaling phase occurred. “Connection errors typically indicate a deeper problem at the network layer,” according to ThousandEyes.
- Takeaways: ThousandEyes detected traffic being dropped before sessions could be established. But there were no visible BGP route changes, which would typically occur during DDoS mitigation. “It was a network-level failure, but not what it may have first appeared,” ThousandEyes noted.
Zoom: April 16
- Duration: Approximately two hours
- Symptoms: All Zoom services were unavailable globally.
- Cause: Zoom’s name server (NS) records disappeared from the top-level domain (TLD) nameservers, making the service unreachable despite healthy infrastructure.
- Takeaways: “Although the servers themselves were healthy throughout and were answering correctly when queried directly, the DNS resolvers couldn’t find them because of the missing records,” ThousandEyes reported. The incident highlights how failures above an organization’s Domain Name System (DNS) layer can completely knock out services.
- Duration: More than two hours
- Symptoms: The application’s front-end loaded normally, but tracks and videos would not play properly.
- Cause: Backend service issues while network connectivity, DNS, and CDN “all looked healthy.”
- Takeaways: “The vital signs were all good: connectivity, DNS, and CDN all looked healthy,” according to ThousandEyes, which added that the incident illustrated how “server-side failures can quietly cripple core functionality while giving the appearance that everything is working normally.”
Google Cloud: June 12
- Duration: More than two and a half hours
- Symptoms: Users couldn’t use Google to authenticate on third-party apps such as Spotify and Fitbit; knock-on consequences impacted Cloudflare services and downstream applications.
- Cause: An invalid automated update disrupted the company’s identity and access management (IAM) system.
- Takeaways: “What you had was a three-tier cascade: Google’s failure led to Cloudflare problems, which affected downstream applications relying on Cloudflare,” ThousandEyes explained, adding that the incident is a “reminder to trace a fault all the way back to source.”
- Duration: More than one hour
- Symptoms: Traffic couldn’t reach numerous websites and apps that rely on Cloudflare’s 1.1.1.1 DNS resolver.
- Cause: A configuration error introduced weeks before was triggered by an unrelated change, prompting Cloudflare’s BGP route announcements to vanish from the global internet routing table.
- Takeaways: “With no valid routes, traffic couldn’t reach Cloudflare’s 1.1.1.1 DNS resolver,” ThousandEyes reported, adding that the incident highlights “how flaws in configuration updates don’t always trigger an immediate crisis, instead storing up problems for later.”
- Duration: More than two hours
- Symptoms: The company’s mobile app, website, and ATM machines all went down and failed simultaneously.
- Cause: A shared backend dependency failed, affecting all customer touchpoints, ThousandEyes estimated.
- Takeaways: “The fact that three different channels with three different frontend technologies failed all at once eliminates app or UI issues,” ThousandEyes noted, explaining that this incident demonstrated “how a single failure can instantly disable every customer touchpoint—and why it’s vital to check all signals before reaching for remedies.”
- Duration: Both incidents lasted several hours
- Symptoms: The first outage affected EMEA region users with slowdowns and failures; the second impacted users worldwide with HTTP 503 errors and connection timeouts.
- Cause: The October 9 incident was caused by software defects that crashed edge sites in the EMEA region; the October 29 outage was triggered by a configuration change
- Takeaways: “Together, these two outages illustrate an important distinction: infrastructure failures tend to be regional with only certain customers affected, whereas configuration errors typically hit all regions simultaneously,” according to ThousandEyes.
- Duration: More than 15 hours for some customers
- Symptoms: Long, global service disruptions affected major customers, including Slack, Atlassian, and Snapchat.
- Cause: Failure in the US-EAST-1 region, but global services such as IAM and DynamoDB Global Tables depended on that regional endpoint, meaning the outage propagated worldwide.
- Takeaways: “The incident highlights how a failure in a single, centralized service can ripple outwards through dependency chains that aren’t always obvious from architecture diagrams,” ThousandEyes noted.
- Duration: Several hours of intermittent, global instability
- Symptoms: Intermittent service disruptions rather than a complete outage
- Cause: A bad configuration file in Cloudflare’s Bot Management system exceeded a hard-coded limit, causing proxies to fail as they loaded the oversized file on staggered five-minute cycles.
- Takeaways: “Because the proxies refreshed configurations on staggered five-minute cycles, we didn’t see a lights-on/lights-off outage, but intermittent, global instability,” ThousandEyes reported, noting that the incident revealed how distributed edge combined with staggered updates can create intermittent issues.
Lessons learned in 2025
ThousandEyes highlighted several takeaways for network operations teams looking to improve their resilience in 2026:
Investigate single symptoms as they can be misleading. The true cause of disruption can emerge from combinations of signals. “If the network seems healthy but users are experiencing issues, the problem might be in the backend,” according to ThousandEyes. “Simultaneous failures across channels point to shared dependencies, while intermittent failures could indicate rollout or edge problems.”
Focus on rapid detection and response. The complexity of modern systems means it’s unrealistic to prevent every possible issue through testing alone. “Instead, focus on building rapid detection and response capabilities, using techniques such as staged rollouts and clear communication with stakeholders,” ThousandEyes stated.


















