Cloud Failure Analysis

From operator and user perspectives

I investigated the availability of cloud services and the cause of failures. I present a first-of-its-kind analysis of cloud service failures using crowdsourced data (Talluri et al., 2021).

I followed it up with a comparative analysis of failure reports of cloud operators, web service, and online games (Talluri et al., 2025).

I helped my colleague, Xiaoyu Chu, investigate the failure characteristics of a medium-scale scientific datacenter during the rise of AI workloads (Chu et al., 2024). I also helped her investigate the failure characteristics of LLM service (Chu et al., 2025).

References

2025

  1. cua_checkpoint.png
    Cloud Uptime Archive: Open-Access Availability Data of Web, Cloud, and Gaming Services
    Sacheendra Talluri, Dante Niewenhuis, Xiaoyu Chu, and 4 more authors
    2025
  2. llm_service_failures.png
    An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models
    Xiaoyu Chu, Sacheendra Talluri, Qingxian Lu, and 1 more author
    In Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering (ICPE 2025), Toronto, Canada, May 5-9, 2025, 2025

2024

  1. generic_vs_ml_jobs.png
    Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis
    Xiaoyu Chu, Daniel Hofstätter, Shashikant Ilager, and 6 more authors
    In 30th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2024, Belgrade, Serbia, October 10-14, 2024, 2024

2021

  1. user_reported_failures.png
    Empirical Characterization of User Reports about Cloud Failures
    Sacheendra Talluri, Leon Overweel, Laurens Versluis, and 2 more authors
    In IEEE International Conference on Autonomic Computing and Self-Organizing Systems, ACSOS 2021, Washington, DC, USA, September 27 - Oct. 1, 2021, 2021