电力非结构化大文本特征提取研究Research on feature extraction of unstructured large power texts
王家凯,黄佩卓,李勇乐,盛爽,刘洋,郑玲,魏振华
WANG Jiakai,HUANG Peizhuo,LI Yongle,SHENG Shuang,LIU Yang,ZHENG Ling,WEI Zhenhua
摘要(Abstract):
电力大文本中存在大量专业词汇缩写和别名等不规则表达,现有分词工具无法有效识别电气工程领域专业词汇,这对非结构化文本的分析和利用造成很大影响。首先,根据电气工程领域非结构化文本特点,提出一种电气工程领域词汇索引规则,基于该索引规则构建的索引集进行分词能够有效改善分词效果,为电力文本特征提取提供基础。其次,利用有效的长文本分割算法保留原始文本语义信息,将基于BERT模型提取的文本特征信息与Word2Vec提取的电力词汇特征信息进行联合嵌入,从而提取到准确的电力非结构化大文本特征。最后,通过实验证明了所提出的电力非结构化大文本特征提取方法的有效性。
Large power texts contain numerous abbreviations of technical terms, alternative names, and irregular expressions. Existing word segmentation tools often fail to identify specialized vocabulary in the electrical engineering field, significantly hindering the analysis and utilization of unstructured texts. To address this challenge, this paper proposes a set of indexing rules tailored to the characteristics of unstructured texts in electrical engineering. Segmentation based on these rules can significantly enhance segmentation accuracy, laying a solid foundation for feature extraction of power texts. Furthermore, by employing effective long-text segmentation algorithms to preserve the semantic information of the original text, the paper integrates and embeds text feature information extracted by the BERT model with vocabulary feature information extracted by Word2Vec. This combined approach enables the extraction of precise features from large unstructured power texts. Finally, experimental results have demonstrated the effectiveness of the proposed method for extracting features from large unstructured power texts.
关键词(KeyWords):
电力大文本;特征提取;BERT;文本分割;联合嵌入
large power text;feature extraction;BERT;text segmentation;integrate and embed
基金项目(Foundation): 国家自然科学基金(62373150);; 国家电网公司大数据中心科技专项资助项目(SGSJ0000YYJS2310054)
作者(Author):
王家凯,黄佩卓,李勇乐,盛爽,刘洋,郑玲,魏振华
WANG Jiakai,HUANG Peizhuo,LI Yongle,SHENG Shuang,LIU Yang,ZHENG Ling,WEI Zhenhua
DOI: 10.19585/j.zjdl.202406013
参考文献(References):
- [1]王慧芳,曹靖,罗麟.电力文本数据挖掘现状及挑战[J].浙江电力,2019,38(3):1-7.WANG Huifang,CAO Jing,LUO Lin.Current status and challenges of power text data mining[J].Zhejiang Electric Power,2019,38(3):1-7.
- [2]刘文松,胡竹青,张锦辉,等.基于文本特征增强的电力命名实体识别[J].电力系统自动化,2022,46(21):134-142.LIU Wensong,HU Zhuqing,ZHANG Jinhui,et al.Named entity recognition for electric power industry based on enhanced text features[J]. Automation of Electric Power Systems,2022,46(21):134-142.
- [3]王慧芳,叶睿恺,罗斌,等.电力领域数据驱动建模实践与思考[J].浙江电力,2022,41(10):3-10.WANG Huifang,YE Ruikai,LUO Bin,et al.Practice and reflection on data-driven modeling in electric power domain[J].Zhejiang Electric Power,2022,41(10):3-10.
- [4] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].ArXiv e-Prints,2013:arXiv:1301.3781.
- [5] PENNINGTON J,SOCHER R,MANNING C. Glove:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). Doha,Qatar. Stroudsburg,PA,USA:Association for Computational Linguistics,2014:1532-1543.
- [6] PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J]. ArXiv e-Prints,2018:arXiv:1802.05365.
- [7] RADFORD A,NARASIMHAN K,SALIMANS T,et al. Improving language understanding by generative pretraining[J].Computer Science,2018.
- [8] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.December 4-9,2017,Long Beach,California,USA.ACM,2017:6000-6010.
- [9] DEVLIN J,CHANG M W,LEE K,et al. Bert:Pretraining of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv:1810.04805,2018.
- [10] DOWDELL T D,ZHANG H Y.Language modelling for source code with transformer-XL[J].arXiv preprint arXiv:2007.15813,2020.
- [11]曾骏,王子威,于扬,等.自然语言处理领域中的词嵌入方法综述[J].计算机科学与探索,2024,18(1):24-43.ZENG Jun,WANG Ziwei,YU Yang, et al.Word embedding methods in natural language processing:a review[J].Journal of Frontiers of Computer Science and Technology,2024,18(1):24-43.
- [12]覃俊,刘璐,刘晶,等.基于BERT与主题模型联合增强的长文档检索模型[J].中南民族大学学报(自然科学版),2023,42(4):469-476.QIN Jun,LIU Lu,LIU Jing,et al. Long document retrieval model based on the joint enhancement of BERT and topic model[J].Journal of South-Central Minzu University(Natural Science Edition),2023,42(4):469-476.
- [13]李景玉.基于BERT的孪生网络计算句子语义相似度[J].科技资讯,2021,19(32):1-4.LI Jingyu.Siamese network computing sentence semantic similarity based on BERT[J].Science&Technology Information,2021,19(32):1-4.
- [14]石慧.基于TF-IDF和机器学习的文本向量化与分类研究[D].武汉:华中科技大学,2022.SHI Hui.Research on text vectorization and classification based on TF-IDF and machine learning[D].Wuhan:Huazhong University of Science and Technology,2022.
- [15]杨先凤,龚睿,李自强.基于MCA-BERT的数学文本分类方法[J].计算机工程与设计,2023,44(8):2312-2319.YANG Xianfeng,GONG Rui,LI Ziqiang. Mathematical text classification method based on MCA-BERT[J].Computer Engineering and Design,2023,44(8):2312-2319.
- [16]罗欣,张爽.深度学习在电力潜在投诉识别分类中的应用[J].浙江电力,2017,36(10):83-86.LUO Xin,ZHANG Shuang.Application of deep learning in identification and classification of potential complaints of electric power[J]. Zhejiang Electric Power,2017,36(10):83-86.