当前位置: X-MOL 学术Front. Inform. Technol. Electron. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Web page classification based on heterogeneous features and a combination of multiple classifiers
Frontiers of Information Technology & Electronic Engineering ( IF 2.7 ) Pub Date : 2020-07-29 , DOI: 10.1631/fitee.1900240
Li Deng , Xin Du , Ji-zhong Shen

Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.



中文翻译:

基于异构功能和多个分类器组合的网页分类

可以通过评估网页的特征来实现精确的网页分类,并且网页的结构特征是其文本特征的有效补充。各种分类器具有不同的特征,并且可以组合多个分类器以允许分类器相互补充。本文提出了一种基于异构特征和多个分类器组合的网页分类方法。与计算HTML标签的频率不同,我们利用HTML标签的树状结构来表征网页的结构特征。异构文本特征和建议的树状结构特征被转换为矢量并融合。在此提出置信度作为通过计算一组样本的分类准确度来比较不同分类器分类结果的标准。基于置信度将多个分类器与不同的决策策略(例如投票,置信度比较和直接输出)组合在一起,以给出最终的分类结果。实验结果表明,在Amazon数据集,7网络风格数据集和DMOZ数据集上,准确性分别提高到94.2%,95.4%和95.7%。文本特征与建议的结构特征的融合是一种综合方法,其准确性高于仅使用文本特征时的准确性。同时,通过组合多个分类器,可以提高网页分类的准确性,

更新日期:2020-07-29
down
wechat
bug