A deep search method to survey data portals in the whole web: toward a machine learning classification model,Government Information Quarterly

当前位置： X-MOL 学术 › Government Information Quarterly › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A deep search method to survey data portals in the whole web: toward a machine learning classification model
Government Information Quarterly ( IF 7.8 ) Pub Date : 2020-08-17 , DOI: 10.1016/j.giq.2020.101510
Andreiwid Sheffer Correa , Alencar Melo Jr. , Flavio Soares Correa da Silva

The emergence of standardized open data software platforms has provided a similar set of features to sustain the lifecycle of open data practices, which includes storing, managing, publishing, and visualizing data, in addition to providing an out-of-the-box solution for data portals. Accordingly, the dissemination of data portals that implement such platforms has paved the way for automation, wherein (meta)data extraction supplies the demand for quantity-oriented metrics, mainly for benchmark purposes. This has given rise to an issue regarding how to survey data portals globally, especially reducing the manual efforts, while covering a wide variety of sources that may not implement standardized solutions. Thus, this study raises two main problems: searching for standardized open data software platforms and identifying specific developed web-based software operated as data portals. This study aims to develop a method that deeply searches each web page on the internet and formalizes a machine learning classification model to improve the identification of data portals, irrespective of how these data portals implement a standardized open data software platform and comply with the open data technical guidelines. The contributions of this work have been demonstrated through a list of 1,650 open data portals generalized in a training model that makes it feasible to distinguish between a data portal (that may or may not implement a standardized platform) and an ordinary web page. The results provide new insights on how machine-readable, publicly available data are affected by artificial intelligence, with special focus on how it can be used to understand data openness worldwide.

中文翻译：

调查整个网络中数据门户的深度搜索方法：面向机器学习分类模型

标准化开放数据软件平台的出现提供了一组类似的功能来维持开放数据实践的生命周期，其中包括存储，管理，发布和可视化数据，此外还提供了以下方面的即用型解决方案：数据门户。因此，实现这种平台的数据门户的传播为自动化铺平了道路，其中（元）数据提取满足了以数量为导向的指标的需求，主要用于基准测试。这引起了有关如何在全球范围内调查数据门户的问题，尤其是减少了人工工作，同时涵盖了可能未实施标准化解决方案的各种来源。因此，这项研究提出了两个主要问题：搜索标准化的开放数据软件平台，并确定作为数据门户网站运行的特定开发的基于Web的软件。这项研究的目的是开发一种方法，以深入搜索互联网上的每个网页并规范化机器学习分类模型，以改善数据门户的识别，而不论这些数据门户如何实现标准化的开放数据软件平台并遵守开放数据技术准则。这项工作的贡献已通过培训模型中概括的1,650个开放数据门户列表得到了证明，这使得区分数据门户（可能实现或未实现标准化平台）和普通网页变得可行。结果提供了有关机器可读性，

更新日期：2020-08-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文