当前位置: X-MOL 学术arXiv.cs.DS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TADOC: Text Analytics Directly on Compression
arXiv - CS - Data Structures and Algorithms Pub Date : 2020-09-20 , DOI: arxiv-2009.09442
Feng Zhang, Jidong Zhai, Xipeng Shen, Dalin Wang, Zheng Chen, Onur Mutlu, Wenguang Chen, Xiaoyong Du

This article provides a comprehensive description of Text Analytics Directly on Compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

中文翻译:

TADOC:直接基于压缩的文本分析

本文全面介绍了直接压缩文本分析 (TADOC),它支持对压缩文本数据进行直接文档分析。文章解释了 TADOC 的概念及其有效实现的挑战。此外,还提出了一系列有效应对这些挑战的指南和技术解决方案,包括采用分层压缩方法和一组新颖的算法和数据结构设计。对六项不同复杂度的数据分析任务的实验表明,TADOC 可以节省 90.8% 的存储空间和 87.9% 的内存使用量,同时将数据处理时间减半。
更新日期:2020-09-22
down
wechat
bug