An efficient scheme for automatic web pages categorization using the support vector machine,New Review of Hypermedia and Multimedia

当前位置： X-MOL 学术 › New Rev. Hypermedia Multimed. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An efficient scheme for automatic web pages categorization using the support vector machine
New Review of Hypermedia and Multimedia ( IF 1.4 ) Pub Date : 2016-05-04 , DOI: 10.1080/13614568.2016.1152316
Vinod Kumar Bhalla , Neeraj Kumar

ABSTRACT In the past few years, with an evolution of the Internet and related technologies, the number of the Internet users grows exponentially. These users demand access to relevant web pages from the Internet within fraction of seconds. To achieve this goal, there is a requirement of an efficient categorization of web page contents. Manual categorization of these billions of web pages to achieve high accuracy is a challenging task. Most of the existing techniques reported in the literature are semi-automatic. Using these techniques, higher level of accuracy cannot be achieved. To achieve these goals, this paper proposes an automatic web pages categorization into the domain category. The proposed scheme is based on the identification of specific and relevant features of the web pages. In the proposed scheme, first extraction and evaluation of features are done followed by filtering the feature set for categorization of domain web pages. A feature extraction tool based on the HTML document object model of the web page is developed in the proposed scheme. Feature extraction and weight assignment are based on the collection of domain-specific keyword list developed by considering various domain pages. Moreover, the keyword list is reduced on the basis of ids of keywords in keyword list. Also, stemming of keywords and tag text is done to achieve a higher accuracy. An extensive feature set is generated to develop a robust classification technique. The proposed scheme was evaluated using a machine learning method in combination with feature extraction and statistical analysis using support vector machine kernel as the classification tool. The results obtained confirm the effectiveness of the proposed scheme in terms of its accuracy in different categories of web pages.

中文翻译：

一种使用支持向量机的自动网页分类的有效方案

摘要过去几年，随着互联网及相关技术的发展，互联网用户数量呈指数级增长。这些用户要求在几秒钟内从 Internet 访问相关网页。为了实现这一目标，需要对网页内容进行有效的分类。手动分类这数十亿个网页以实现高精度是一项具有挑战性的任务。文献中报道的大多数现有技术都是半自动的。使用这些技术，无法实现更高水平的准确度。为了实现这些目标，本文提出了一种自动将网页分类到域类别中的方法。所提出的方案基于对网页特定和相关特征的识别。在提议的方案中，首先完成特征的提取和评估，然后过滤特征集以对域网页进行分类。在所提出的方案中开发了基于网页的HTML文档对象模型的特征提取工具。特征提取和权重分配基于通过考虑各种域页面开发的特定于域的关键字列表的集合。此外，关键字列表是根据关键字列表中关键字的 id 来缩减的。此外，对关键字和标签文本进行词干提取以实现更高的准确性。生成广泛的特征集以开发强大的分类技术。使用机器学习方法结合特征提取和统计分析使用支持向量机内核作为分类工具对所提出的方案进行评估。

更新日期：2016-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11