Reshaping Text Data for Efficient Processing on Amazon EC2,Scientific Programming

当前位置： X-MOL 学术 › Sci. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Reshaping Text Data for Efficient Processing on Amazon EC2
Scientific Programming Pub Date : 2011 , DOI: 10.3233/spr-2011-0322
Gabriela Turcu, Ian Foster, Svetlozar Nestorov

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc.). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We derive an execution plan using an empirically determined application performance model. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

中文翻译：

重塑文本数据以在Amazon EC2上进行有效处理

如今，需要文本分析工具来处理越来越大的语料库，这些语料库通常组织成小文件（摘要，新闻文章等）。云计算为解决此类问题提供了一种方便的，按需按需付费的计算环境。我们从用户角度研究了Amazon EC2云上的配置，试图提供既及时又具有成本效益的调度策略。我们使用经验确定的应用程序性能模型来得出执行计划。性能测试的首要目标是确定应用程序要使用的最佳文件大小。使用子集和首次拟合试探法，我们通过合并文件来重塑输入数据的形状，以尽可能紧密地匹配所需的文件大小。这也加快了检索应用程序结果的任务，通过减少输出分段。基于对小数据集的测量，使用对应用程序性能的预测，我们设计了一个执行计划，该计划应满足用户指定的截止日期，同时将成本降到最低。

更新日期：2020-09-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11