当前位置: X-MOL 学术J. Web Semant. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On revealing shared conceptualization among open datasets
Journal of Web Semantics ( IF 2.5 ) Pub Date : 2020-12-16 , DOI: 10.1016/j.websem.2020.100624
Miloš Bogdanović , Nataša Veljković , Milena Frtunić Gligorijević , Darko Puflović , Leonid Stoimenov

Openness and transparency initiatives are not only milestones of science progress but have also influenced various fields of organization and industry. Under this influence, varieties of government institutions worldwide have published a large number of datasets through open data portals. Government data covers diverse subjects and the scale of available data is growing every year. Published data is expected to be both accessible and discoverable. For these purposes, portals take advantage of metadata accompanying datasets. However, a part of metadata is often missing which decreases users’ ability to obtain the desired information. As the scale of published datasets grows, this problem increases. An approach we describe in this paper is focused towards decreasing this problem by implementing knowledge structures and algorithms capable of proposing the best match for the category where an uncategorized dataset should belong to. By doing so, our aim is twofold: enrich datasets metadata by suggesting an appropriate category and increase its visibility and discoverability. Our approach relies on information regarding open datasets provided by users — dataset description contained within dataset tags. Since dataset tags express low consistency due to their origin, in this paper we will present a method of optimizing their usage through means of semantic similarity measures based on natural language processing mechanisms. Optimization is performed in terms of reducing the number of distinct tag values used for dataset description. Once optimized, dataset tags are used to reveal shared conceptualization originating from their usage by means of Formal Concept Analysis. We will demonstrate the advantage of our proposal by comparing concept lattices generated using Formal Concept Analysis before and after the optimization process and use generated structure as a knowledge base to categorize uncategorized open datasets. Finally, we will present a categorization mechanism based on the generated knowledge base that takes advantage of semantic similarity measures to propose a category suitable for an uncategorized dataset.



中文翻译:

关于揭示开放数据集之间的共享概念化

开放和透明的倡议不仅是科学进步的里程碑,而且也影响了组织和行业的各个领域。在这种影响下,世界各地的各种政府机构已经通过开放数据门户网站发布了大量数据集。政府数据涵盖了各种主题,可用数据的规模每年都在增长。预计发布的数据将是可访问和可发现的。为此,门户网站利用元数据附带的数据集。但是,通常会丢失一部分元数据,这会降低用户获得所需信息的能力。随着已发布数据集规模的增长,此问题也随之增加。我们在本文中描述的一种方法致力于通过实现能够提出未分类数据集所属类别的最佳匹配的知识结构和算法来减少此问题。通过这样做,我们的目标是双重的:通过建议适当的类别来丰富数据集元数据并增加其可见性和可发现性。我们的方法依赖于用户提供的有关开放数据集的信息-数据集标签中包含的数据集描述。由于数据集标签由于其来源而表现出较低的一致性,因此在本文中,我们将提出一种基于自然语言处理机制的通过语义相似性度量来优化其用法的方法。根据减少用于数据集描述的不同标签值的数量来执行优化。一旦优化,数据集标签将用于通过形式概念分析来揭示源自其用法的共享概念化。我们将通过比较优化过程之前和之后使用形式概念分析生成的概念格并使用生成的结构作为知识库对未分类的开放数据集进行分类来证明我们建议的优势。最后,我们将基于生成的知识库提出一种分类机制,该机制利用语义相似性度量来提出适用于未分类数据集的类别。我们将通过比较优化过程之前和之后使用形式概念分析生成的概念格子并使用生成的结构作为知识库对未分类的开放数据集进行分类,来证明我们建议的优势。最后,我们将基于生成的知识库提出一种分类机制,该机制利用语义相似性度量来提出适用于未分类数据集的类别。我们将通过比较优化过程之前和之后使用形式概念分析生成的概念格并使用生成的结构作为知识库对未分类的开放数据集进行分类来证明我们建议的优势。最后,我们将基于生成的知识库提出一种分类机制,该机制利用语义相似性度量来提出适用于未分类数据集的类别。

更新日期:2020-12-24
down
wechat
bug