Yemanja - A Layered Fault Localization System for Multi-Domain Computing Utilities

被引:14
作者
Appleby K. [1 ]
Goldszmidt G. [1 ]
Steinder M. [2 ]
机构
[1] IBM T.J. Watson Research Center, Hawthorne, NY 10532
[2] Computer and Information Sciences, University of Delaware, Newark
关键词
Event correlation; Fault and performance management; Problem determination; Service level agreements;
D O I
10.1023/A:1015954732370
中图分类号
学科分类号
摘要
Yemanja is a model-based event correlation engine for multi-layer fault diagnosis. It targets complex propagating fault scenarios, and can smoothly correlate low-level network events with high-level application performance alerts related to quality-of-service violations. Entity-models that represent devices or abstract components encapsulate their behavior. Distantly associated entity-models are not explicitly aware of each other, and communicate through internal event chains. Yemanja's state-based engine supports generic scenario definitions, prioritization of alternate solutions, integrated problem and device testing, and simultaneous analysis of overlapping problems. The system of correlation rules was developed based on the analysis of device and layer functions, and the dependencies among physical and abstract system components. The primary objectives of this research include the development of reusable, configuration independent, correlation scenarios, adaptability and extensibility of the engine to match the constantly changing topology of a multi-domain server farm, and development of a concise specification language that is relatively simple yet powerful.
引用
收藏
页码:171 / 194
页数:23
相关论文
共 31 条
  • [1] Jakobson G., Weissman M.D., Alarm correlation, IEEE Network, 7, 6, pp. 52-59, (1993)
  • [2] Liu G., Mok A.K., Yang E.J., Composite events for network event correlation, Integrated Network Management VI, pp. 247-260, (1999)
  • [3] Mansouri-Samani M., Sloman M., GEM - A generalized event monitoring language for distributed systems, IEE/IOP/BCS Distributed Systems Engineering Journal, 4, 2, pp. 96-108, (1997)
  • [4] Gopal R., Layered model for supporting fault isolation and recovery, NOMS 2000: 2000 IEEE/IFIP Network Operations and Symposium "the Network Planet: Management Beyond 2000", pp. 729-742, (2000)
  • [5] Schwartz S.H., Zager D., Value-oriented network management, NOMS 2000
  • [6] 2000 IEEE/IFIP Network Operations and Symposium "the Network Planet: Management Beyond 2000", pp. 715-728, (2000)
  • [7] Hiles A., Service Level Agreements: Managing Cost and Quality in Service Relationships, (1993)
  • [8] Appleby K., Fakhouri S., Fong L., Goldszmidt G., Kalantar M., Krishnakumar S., Pazel D., Perching J., Rochwerger B., Océano - SLA-based management of a computing utility, Integrated Network Management VII, pp. 855-868, (2001)
  • [9] Cunha J., Da Silva F.Q.B., Goldszmidt O., Appleby K., An architecture to define, store, and monitor iSLAs in server farm, Proceedings of Latin American Network Operations and Management Symposium, (2001)
  • [10] Katker S., A modeling framework for integrated distributed systems fault management, Proceeding of the IFIP/IEEE International Conference on Distributed Platforms, pp. 187-198, (1996)