当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Lossless Pruned Naive Bayes for Big Data Classifications
Big Data Research ( IF 3.5 ) Pub Date : 2018-07-17 , DOI: 10.1016/j.bdr.2018.05.007
Nanfei Sun , Bingjun Sun , Jian (Denny) Lin , Michael Yu-Chi Wu

In a fast growing big data era, volume and varieties of data processed in Internet applications drastically increase. Real-world search engines commonly use text classifiers with thousands of classes to improve relevance or data quality. These large scale classification problems lead to severe runtime performance challenges, so practitioners often resort to fast approximation techniques. However, the increase in classification speed comes at a cost, as approximations are lossy, mis-assigning classes relative to the original reference classification algorithm. To address this problem, we introduce a Lossless Pruned Naive Bayes (LPNB) classification algorithm tailored to real-world, big data applications with thousands of classes. LPNB achieves significant speed-ups by drawing on Information Retrieval (IR) techniques for efficient posting list traversal and pruning. We show empirically that LPNB can classify text up to eleven times faster than standard Naive Bayes on a real-world data set with 7205 classes, with larger gains extrapolated for larger taxonomies. In practice, the achieved acceleration is significant as it can greatly cut required computation time. In addition, it is lossless: the output is identical to standard Naive Bayes, in contrast to extant techniques such as hierarchical classification. The acceleration does not rely on the taxonomy structure, and it can be used for both hierarchical and flat taxonomies.



中文翻译:

大数据分类的无损修剪朴素贝叶斯

在快速发展的大数据时代,Internet应用程序中处理的数据量和种类急剧增加。现实世界中的搜索引擎通常使用具有数千个类别的文本分类器来提高相关性或数据质量。这些大规模分类问题导致严重的运行时性能挑战,因此从业人员经常求助于快速逼近技术。但是,分类速度的提高是有代价的,因为相对于原始参考分类算法而言,近似值是有损的,误分配的类。为了解决这个问题,我们引入了无损修剪朴素贝叶斯(LPNB)分类算法,针对具有数千个类别的现实世界中的大数据应用量身定制。LPNB通过利用信息检索(IR)技术来实现有效的过帐列表遍历和修剪,从而显着提高了速度。我们凭经验显示,在具有7205个类别的真实数据集上,LPNB可以比标准朴素贝叶斯将文本分类的速度快11倍,并为较大的分类法推断了较大的收益。在实践中,实现的加速度非常重要,因为它可以大大减少所需的计算时间。此外,它是无损的:与现有技术(例如层次分类)相比,输出与标准朴素贝叶斯相同。加速不依赖于分类法结构,并且可以用于分层和统一分类法。

更新日期:2018-07-17
down
wechat
bug