Enhancing IT Infrstructure Resilience

Navigating Cross-System Interaction (CSI) Failures

In the expansive landscape of hyperscale Data Center environments the phenomenon known as cross-system interaction (CSI) failures has captured the spotlight. CSI failures are also referred to Gray Failures in some literature ([2] from Microsoft Research and Azure). CSI failures occur within IT infrastructure that encompass various subsystems such as compute, storage, and networking. While these components collaboratively provide seamless services, their interactions can sometimes lead to failures stemming from disparities between them. It's vital to recognize these interactions, as focusing solely on individual systems and depending on redundancy falls short of addressing CSI failures. Cross-system interaction failures hold significant repercussions. An analysis of public cloud (e.g., Google, Azure and AWS) incidents indicates that these failures contribute to a substantial 20% of catastrophic occurrences, underscoring their seriousness. Delving into open-source systems, it's been observed that CSI failures are widespread, accounting for 37% of CSI failures.

CSI Failures and System Crashes

Particularly alarming is the fact that most CSI failures manifest as crashes. This highlights the inadequacy of current fault tolerance mechanisms in addressing such failures. To counter CSI failures, a proactive approach is championed, emphasizing testing and validating CSI issues before deployment. An illustrative case is the cross-testing of the Spark-Hive data plane, which uncovered 15 new discrepancies. This underscores the need for thorough testing, seamless integration, and ongoing monitoring of subsystem interactions in cloud environments.

What About Systems Already in Operation?

Integral to this comprehensive strategy is meticulous monitoring of file systems and drive components. CSI failures, known for their subtle impact on system performance, often originate from storage-related challenges. These issues, spanning from high latency and unreliable I/O to capacity depletion, might escape immediate detection by system failure or threshold detectors. However, they profoundly affect application performance and user experience.

A relevant case study highlights the scenario of storage server capacity pressure culminating in cascading failures across nodes. The initial conditions of CSI failures elude detection by the storage manager, resulting in a significant outage. As file systems and the underlying drives shift from latent faults to degraded modes, a phenomenon of differential observability emerges, where the system labels them as healthy while applications suffer disruptions.

Introducing SMARTDriveAI

Effectively identifying CSI failure symptoms within the storage ecosystem necessitates continuous, comprehensive end-to-end monitoring of file system and drive health and performance. This involves metrics that go beyond basic measurements, encompassing performance and health indicators as well as capacity utilization and data performance trends to provide a more accurate reflection of the application's experience.

SMARTDriveAI was devised to proactively address these challenges. SMARTDriveAI is a SaaS based service that utilizes AI models and analytics to monitor file system and drive health and performance. SMARTDriveAI is designed to automatically identify patterns indicative of CSI, such as latency, I/O and performance anomalies, effectively bridging the observability gap. For example, if a drive in a RAID set is exhibiting a fail-slow event, the entire RAID set latency can increase. This increase in latency could affect applications that require deterministic response from the file system resident on the RAID. The applications could then queue up many requests thereby lowering the application response time and lead to application request queue overflow and subseqent failure. Current IT measurements could potentially miss this sceneario until it's too late. SMARTDriveAI can detect the initial fail-slow event, thereby notifying IT operations which can be proactive in response.

Conclusion

As Data Centers expand, novel challenges arise in system interactions. Components that might appear healthy in isolation can cause issues when they interact with other systems. Vigilant monitoring of file system and drive health and performance serves as a cornerstone in mitigating a certain class of CSI failures in modern IT systems, particularly those rooted in storage system issues. By prioritizing preemptive analysis and applying machine learning models and analytics, Data Center environments can bolster themselves against certain classes of CSI failures.


More information on CSI and Gray Failures can be found in the following papers:

[1] Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. 2023. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Eighteenth European Conference on Computer Systems (EuroSys '23), May 2023, Rome, Italy. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3552326.3587448

[2] Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of HotOS '17, Whistler, BC, Canada, May 2017, 6 pages. https://doi.org/10.1145/3102980.3103005