当前位置: X-MOL 学术Data Knowl. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Design and implementation of ETL processes using BPMN and relational algebra
Data & Knowledge Engineering ( IF 2.7 ) Pub Date : 2020-06-13 , DOI: 10.1016/j.datak.2020.101837
Judith Awiti , Alejandro A. Vaisman , Esteban Zimányi

Extraction, transformation, and loading (ETL) processes are used to extract data from internal and external sources of an organization, transform these data, and load them into a data warehouse. The Business Process Modeling and Notation (BPMN) has been proposed for expressing ETL processes at a conceptual level. A different approach is studied in this paper, where relational algebra (RA), extended with update operations, is used for specifying ETL processes. In this approach, data tasks in an ETL workflow can be automatically translated into SQL queries to be executed over a DBMS. To illustrate this study, the paper addresses the problem of updating Slowly Changing Dimensions (SCDs) with dependencies, that is, the case when updating a SCD table impacts on associated SCD tables. Tackling this problem requires extending the classic RA with update operations. The paper also shows the implementation of a portion of the TPC-DI benchmark that results from both approaches. Thus, the paper presents three implementations: (a) An SQL implementation based on the extended RA-based specification of an ETL process expressed in BPMN4ETL; and (b) Two implementations of workflows that follow from BPMN4ETL, one that uses the Pentaho DI tool, and another one that uses Talend Open Studio for DI. Experiments over these implementations of the TPC-DI benchmark for different scale factors were carried out, and are described and discussed in the paper, showing that the extended RA approach results in more efficient processes than the ones produced by implementing the BPMN4ETL specification over the mentioned ETL tools. The reasons for this result are also discussed.



中文翻译:

使用BPMN和关系代数的ETL流程的设计和实现

提取,转换和加载(ETL)流程用于从组织的内部和外部源提取数据,转换这些数据并将其加载到数据仓库中。已经提出了业务流程建模和表示法(BPMN),用于在概念级别上表达ETL流程。本文研究了另一种方法,其中将关系代数(RA)扩展为更新操作,用于指定ETL过程。通过这种方法,可以将ETL工作流程中的数据任务自动转换为SQL查询,以通过DBMS执行。为了说明此研究,本文解决了使用依赖项更新慢变化维度(SCD)的问题,即更新SCD表时会影响关联的SCD表。要解决此问题,需要使用更新操作来扩展经典RA。本文还显示了这两种方法产生的TPC-DI基准测试的一部分的实现。因此,本文提出了三种实现:(a)基于BPMN4ETL中表示的ETL流程的基于RA的扩展规范的SQL实现;(b)BPMN4ETL遵循的两种工作流实施方式,一种使用Pentaho DI工具,另一种使用Talend Open Studio for DI。在TPC-DI基准的这些实现的不同尺寸因子上进行了这些实验,并在本文中进行了描述和讨论,表明扩展的RA方法比在上述提到的情况下实现BPMN4ETL规范所产生的过程更有效。 ETL工具。

更新日期:2020-06-13
down
wechat
bug