当前位置: X-MOL 学术J. Geogr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Strategies to access web-enabled urban spatial data for socioeconomic research using R functions
Journal of Geographical Systems ( IF 2.417 ) Pub Date : 2019-08-23 , DOI: 10.1007/s10109-019-00309-y
Andrés Vallone , Coro Chasco , Beatriz Sánchez

Since the introduction of the World Wide Web in the 1990s, available information for research purposes has increased exponentially, leading to a significant proliferation of research based on web-enabled data. Nowadays the use of internet-enabled databases, obtained by either primary data online surveys or secondary official and non-official registers, is common. However, information disposal varies depending on data category and country and specifically, the collection of microdata at low geographical level for urban analysis can be a challenge. The most common difficulties when working with secondary web-enabled data can be grouped into two categories: accessibility and availability problems. Accessibility problems are present when the data publication in the servers blocks or delays the download process, which becomes a tedious reiterative task that can produce errors in the construction of big databases. Availability problems usually arise when official agencies restrict access to the information for statistical confidentiality reasons. In order to overcome some of these problems, this paper presents different strategies based on URL parsing, PDF text extraction, and web scraping. A set of functions, which are available under a GPL-2 license, were built in an R package to specifically extract and organize databases at the municipality level (NUTS 5) in Spain for population, unemployment, vehicle fleet, and firm characteristics.

中文翻译:

使用R函数访问启用网络的城市空间数据进行社会经济研究的策略

自1990年代引入万维网以来,用于研究目的的可用信息呈指数增长,导致基于网络的数据的研究大量增加。如今,通过主要在线数据调查或次要官方和非官方登记册获得的具有互联网功能的数据库的使用已很普遍。但是,信息处理取决于数据类别和国家/地区,特别是在低地理级别收集微数据以进行城市分析可能是一个挑战。使用辅助Web数据时,最常见的困难可以分为两类:可访问性和可用性问题。当服务器中的数据发布阻止或延迟下载过程时,会出现可访问性问题,这成为繁琐的重复任务,可能会在构建大型数据库时产生错误。当官方机构出于统计机密性原因限制访问信息时,通常会出现可用性问题。为了克服其中的一些问题,本文提出了基于URL解析,PDF文本提取和Web抓取的不同策略。在R包中构建了一组功能,这些功能可根据GPL-2许可使用,以专门提取和组织西班牙市政级别(NUTS 5)的人口,失业,车队和企业特征的数据库。为了克服其中的一些问题,本文提出了基于URL解析,PDF文本提取和Web抓取的不同策略。在R包中构建了一组功能,这些功能可根据GPL-2许可使用,以专门提取和组织西班牙市政级别(NUTS 5)的人口,失业,车队和企业特征的数据库。为了克服其中的一些问题,本文提出了基于URL解析,PDF文本提取和Web抓取的不同策略。在R包中构建了一组功能,这些功能可根据GPL-2许可使用,以专门提取和组织西班牙市政级别(NUTS 5)的人口,失业,车队和企业特征的数据库。
更新日期:2019-08-23
down
wechat
bug