Soft and Declarative Fishing of Information in Big Data Lake,IEEE Transactions on Fuzzy Systems

当前位置： X-MOL 学术 › IEEE Trans. Fuzzy Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Soft and Declarative Fishing of Information in Big Data Lake
IEEE Transactions on Fuzzy Systems ( IF 11.9 ) Pub Date : 2018-10-01 , DOI: 10.1109/tfuzz.2018.2812157
Bozena Malysiak-Mrozek , Marek Stabla , Dariusz Mrozek

In recent years, many fields that experience a sudden proliferation of data, which increases the volume of data that must be processed and the variety of formats the data is stored in have been identified. This causes pressure on existing compute infrastructures and data analysis methods, as more and more data are considered as a useful source of information for making critical decisions in particular fields. Among these fields exist several areas related to human life, e.g., various branches of medicine, where the uncertainty of data complicates the data analysis, and where the inclusion of fuzzy expert knowledge in data processing brings many advantages. In this paper, we show how fuzzy techniques can be incorporated in big data analytics carried out with the declarative U-SQL language over a big data lake located on the cloud. We define the concept of big data lake together with the Extract, Process, and Store process performed while schematizing and processing data from the Data Lake, and while storing results of the processing. Our solution, developed as a Fuzzy Search Library for Data Lake, introduces the possibility of massively parallel, declarative querying of big data lake with simple and complex fuzzy search criteria, using fuzzy linguistic terms in various data transformations, and fuzzy grouping. Presented ideas are exemplified by a distributed analysis of large volumes of biomedical data on Microsoft Azure cloud. Results of performed tests confirm that the presented solution is highly scalable on the Cloud and is a successful step toward soft and declarative processing of data on a large scale. The solution presented in this paper directly addresses three characteristics of big data, i.e., volume, variety, and velocity, and indirectly addresses, veracity and value.

中文翻译：

大数据湖中信息的软性和声明性钓鱼

近年来，许多领域都经历了数据的突然激增，这增加了必须处理的数据量以及存储数据的各种格式。这给现有的计算基础设施和数据分析方法带来了压力，因为越来越多的数据被认为是在特定领域做出关键决策的有用信息来源。在这些领域中存在与人类生活相关的几个领域，例如医学的各个分支，其中数据的不确定性使数据分析复杂化，并且在数据处理中包含模糊专家知识带来了许多优势。在本文中，我们展示了如何将模糊技术纳入使用声明性 U-SQL 语言在位于云上的大数据湖上执行的大数据分析中。我们将大数据湖的概念与在对来自数据湖的数据进行模式化和处理以及存储处理结果时执行的提取、处理和存储过程一起定义。我们的解决方案是作为数据湖的模糊搜索库开发的，它介绍了使用简单和复杂的模糊搜索标准、在各种数据转换中使用模糊语言术语和模糊分组对大数据湖进行大规模并行、声明式查询的可能性。提出的想法通过对 Microsoft Azure 云上的大量生物医学数据进行分布式分析来举例说明。执行测试的结果证实，所提出的解决方案在云上具有高度可扩展性，并且是朝着大规模软性和声明性数据处理迈出的成功一步。

更新日期：2018-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>