Cloud Failure Analysis
From operator and user perspectives
I investigated the availability of cloud services and the cause of failures. I present a first-of-its-kind analysis of cloud service failures using crowdsourced data (Talluri et al., 2021).
I followed it up with a comparative analysis of failure reports of cloud operators, web service, and online games (Talluri et al., 2025).
I helped my colleague, Xiaoyu Chu, investigate the failure characteristics of a medium-scale scientific datacenter during the rise of AI workloads (Chu et al., 2024). I also helped her investigate the failure characteristics of LLM service (Chu et al., 2025).
References
2025
-
- An Empirical Characterization of Outages and Incidents in Public Services for Large Language ModelsIn Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering (ICPE 2025), Toronto, Canada, May 5-9, 2025, 2025