当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Scalable Dataframe Systems
arXiv - CS - Databases Pub Date : 2020-01-03 , DOI: arxiv-2001.00888
Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, Aditya Parameswaran

Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building MODIN, a scaled-up implementation of the most widely-used and complex dataframe API today, Python's pandas. With pandas as a reference, we propose a simple data model and algebra for dataframes to ground discussion in the field. Given this foundation, we lay out an agenda of open research opportunities where the distinct features of dataframes will require extending the state of the art in many dimensions of data management. We discuss the implications of signature data-frame features including flexible schemas, ordering, row/column equivalence, and data/metadata fluidity, as well as the piecemeal, trial-and-error-based approach to interacting with dataframes.

中文翻译:

迈向可扩展的数据帧系统

数据框是表示、准备和分析数据的流行抽象。尽管 Rand Python 中的数据框库取得了显着的成功,但即使在中等规模的数据集上,数据框也面临性能问题。此外,关于数据帧语义存在明显的歧义。在本文中,我们为可扩展数据框系统制定了愿景和路线图。为了展示该领域的潜力,我们报告了我们构建 MODIN 的经验,MODIN 是当今最广泛使用和最复杂的数据帧 API 的扩展实现,Python 的 pandas。以 Pandas 为参考,我们为数据框提出了一个简单的数据模型和代数,以进行该领域的讨论。有了这个基础,我们制定了一个开放研究机会的议程,其中数据框的独特特征将需要在数据管理的许多方面扩展现有技术。我们讨论了签名数据帧特性的含义,包括灵活的模式、排序、行/列等效性和数据/元数据流动性,以及与数据帧交互的零碎、基于试错的方法。
更新日期:2020-06-03
down
wechat
bug