基于机器学习的云平台故障排查方法A Fault Detection Method for Cloud Platform Based on Machine Learning
王艳艳,张文正,沈佳辉,王亭,李小真
WANG Yanyan,ZHANG Wenzheng,SHEN Jiahui,WANG Ting,LI Xiaozhen
摘要(Abstract):
随着云计算的发展,越来越多的企业将系统部署在云环境中,大大提高了企业应用服务的灵活性、弹性、扩展性和效率,浙江电网容器云平台是云计算在电力系统的典型应用。然而,云计算的弹性架构也导致企业应用的运维变得更复杂和难以监控,当前运维手段大多缺乏清晰的云上应用访问可见性,给云环境下的故障排查带来了困难。针对这一问题,提出一种基于机器学习的故障排查方法。首先,通过层次聚类方法动态生成节点的网络拓扑结构,实时监测浙江电网容器云平台的各节点性能指标,以此作为特征向量;然后,采用支持向量机和随机搜索方法对其进行故障分类,达到实时排查故障的目的,有效提高了该云平台的性能和可靠性,验证了机器学习方法在电力系统中的应用前景。
With the development of cloud computing, more and more enterprises have deployed their systems into the cloud environment, which greatly improves the flexibility, elasticity, scalability and efficiency of enterprise application services. The container platform of Zhejiang power grid typifies the application of cloud computing in power systems. However, the flexible architecture of cloud computing also makes the operation and maintenance of enterprise applications more complex and harder to monitor. Most current operation and maintenance methods lack clear visibility of application access on the cloud, which brings difficulties to troubleshooting in the cloud environment. This paper proposes a fault detection method based on machine learning. This method firstly dynamically generates network topology structure by a hierarchical clustering approach, monitors the performance metrics of all nodes in the container platform of Zhejiang power grid in real time, and these metrics are regarded as feature vectors. Then, support vector machine(SVM) and random search method are used for fault classification. The method achieves the goal of real-time troubleshooting, effectively improves the reliability and performance of cloud platform and verifies the application prospect of machine learning methods in power system.
关键词(KeyWords):
机器学习;云计算;支持向量机;平均链接聚类;网络拓扑识别;故障排查
machine learning;cloud computing;support vector machine;average link clustering;network topology identification;fault detection
基金项目(Foundation): 信通业务综合监控平台实施项目(B311XT200048)
作者(Author):
王艳艳,张文正,沈佳辉,王亭,李小真
WANG Yanyan,ZHANG Wenzheng,SHEN Jiahui,WANG Ting,LI Xiaozhen
DOI: 10.19585/j.zjdl.202112017
参考文献(References):
- [1]阙凌燕,蒋正威,肖艳炜,等.调控云关键技术研究及展望[J].浙江电力,2019,38(8):1-7.
- [2]BABU L D D,GUNASEKARAN A,KRISSHNA P V.A decision-based pre-emptive fair scheduling strategy to process cloud computing work-flows for sustainable enterprise management[J].International Journal of Business Information Systems,2017,16(4):409-430.
- [3]李进文.基于云计算的网络异常检测算法研究[D].郑州:郑州大学,2015.
- [4]程方慧,庄洪杰.基于SNMP的网络拓扑发现[J].中国新通信,2017,19(19):126-127.
- [5]段文雪,胡铭,周琼,等.云计算系统可靠性研究综述[J].计算机研究与发展,2020,57(1):102-123.
- [6]曹蓉.计算机网络流量异常检测技术研究[J].计算机产品与流通,2020(7):31.
- [7]NEMATI H,DAGENAIS M R.Virtual CPU state detection and execution flow analysis by host tracing[C]//2016 IEEE International Conferences on Big Data and Cloud Computing(BDCloud), Social Computing and Networking(SocialCom),Sustainable Computing and Commications(SustainCom)(BDCloudSocialComSustain).IEEE,2016.
- [8]章永来,周耀鉴.聚类算法综述[J].计算机应用,2019,39(7):1869-1882.
- [9]韩云春,薛俊华,周伟,等.基于微量元素特征及Fisher判别函数的寺河矿煤层识别方法[J].能源与环保,2018,40(7):49-53.
- [10]MAHFOUZ M A.AVLINK:robust clustering algorithm based on average link applied to protein sequence analysis[J].Journal of Mathematics and System System Science,2016,6(5):205-214.
- [11]周建平,李聪,万书亭,等.基于优化型SVM的高压断路器故障诊断方法研究[J].浙江电力,2019,38(3):17-22.
- [12]李周,许红升,叶彬,等.电力通信网结构优化及拓扑生成算法[J].电气自动化,2017(5):20-23.
- [13]甄凯成,黄河,宋良图.基于Netty和Kafka的物联网数据接入系统[J].计算机工程与应用,2020,56(5):135-140.
- [14]陆高.基于智能计算的超参数优化及其应用研究[D].西安:西安电子科技大学,2018.
- [15]费秀宏.基于Kafka的日志处理平台的研究[D].长春:吉林大学,2017.
- [16]马文科,张茜,周晓杰.基于卷积-循环神经网络的回转窑工况识别[J].控制工程,2020,27(8):1310-1316.
- 机器学习
- 云计算
- 支持向量机
- 平均链接聚类
- 网络拓扑识别
- 故障排查
machine learning - cloud computing
- support vector machine
- average link clustering
- network topology identification
- fault detection