Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

Vallone, Andrés; Chasco, Coro; Sánchez, Beatriz

doi:10.1007/s10109-019-00309-y

Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

Original Article
Published: 23 August 2019

Volume 22, pages 217–239, (2020)
Cite this article

Journal of Geographical Systems Aims and scope Submit manuscript

Andrés Vallone³,
Coro Chasco^1,2 &
Beatriz Sánchez⁴

422 Accesses
3 Citations
4 Altmetric
Explore all metrics

Abstract

Since the introduction of the World Wide Web in the 1990s, available information for research purposes has increased exponentially, leading to a significant proliferation of research based on web-enabled data. Nowadays the use of internet-enabled databases, obtained by either primary data online surveys or secondary official and non-official registers, is common. However, information disposal varies depending on data category and country and specifically, the collection of microdata at low geographical level for urban analysis can be a challenge. The most common difficulties when working with secondary web-enabled data can be grouped into two categories: accessibility and availability problems. Accessibility problems are present when the data publication in the servers blocks or delays the download process, which becomes a tedious reiterative task that can produce errors in the construction of big databases. Availability problems usually arise when official agencies restrict access to the information for statistical confidentiality reasons. In order to overcome some of these problems, this paper presents different strategies based on URL parsing, PDF text extraction, and web scraping. A set of functions, which are available under a GPL-2 license, were built in an R package to specifically extract and organize databases at the municipality level (NUTS 5) in Spain for population, unemployment, vehicle fleet, and firm characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial Analysis Meets Internet Research

A free, open-source tool for identifying urban agglomerations using polygon data

Article 19 January 2017

Notes

We consider ‘web-enabled’ as different than ‘web-based’, which is related to methods used in psychology and behavioral studies (Skitka and Sargis 2006; Denissen et al. 2010).
According Eurostat, The LAUs (Local Administrative Units) are subdivisions of the NUTS 3 regions, which consist of municipalities or equivalent units (formerly NUTS 5). The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing the economic territory of the EU. NUTS 1 are major socio-economic regions (e.g. Spain), NUTS 2 are basic regions for the application of regional policies (e.g. autonomous community of Extremadura) and NUTS 3 are small regions for specific diagnoses (e.g. province of Badajoz).
This R package is freely available from the site https://github.com/amvallone/DataSpa. It must be installed in the R console with the command: devtools::install_github("amvallone/DataSpa").
http://www.ine.es.
http://www.sepe.es.
All the R functions are in the aforementioned repository: https://github.com/amvallone/DataSpa.
These functions deal with two important difficulties derived from the construction of panels for municipality data in Spain. First, they control for municipality entries and removals, which take place almost every year, adapting the final data frame to the configuration corresponding to the last period. Second, they produce a list of name equivalences, based on the information provided by the INE, to manage with constant changes in the municipality names, always assigning the one corresponding the last period.
www.dgt.es.
http://www.dgt.es/es/seguridad-vial/estadisticas-e-indicadores/informacion-municipal.
https://www.pdf2txt.com.
http://www.ine.es/dynt3/inebase/es/index.htm?padre=51&dh=1.
http://www.minetad.gob.es/industria/RII/Paginas/Index.aspx.
http://www.camerdata.es/index.php.
https://www.bvdinfo.com/en-gb/our-products/data/national/sabi.
http://www.gem-spain.com.
https://www.axesor.es.

References

Arauzo Carod JM (2005) Determinants of industrial location: an application for Catalan municipalities*. Pap Reg Sci 84:105–120. https://doi.org/10.1111/j.1435-5957.2005.00006.x
Article Google Scholar
Arauzo-Carod J-M, Viladecans-Marsal E (2009) Industrial location at the intra-metropolitan level: the role of agglomeration economies. Reg Stud 43:545–558. https://doi.org/10.1080/00343400701874172
Article Google Scholar
Atkinson AB, Brandolini A (2001) Promise and pitfalls in the use of “secondary” data-sets: income inequality in OECD countries as a case study. J Econ Lit 39:771–799. https://doi.org/10.1257/jel.39.3.771
Article Google Scholar
Aumueller D (2009) Retrieving metadata for your local scholarly papers. BTW
Beel J, Langer S, Genzmehr M, Müller C (2013) Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. ACM, pp 443–444
Bento AM, Cropper ML, Mobarak AM, Vinha K (2005) The effects of urban spatial structure on travel demand in the United States. Rev Econ Stat 87:466–478. https://doi.org/10.1162/0034653054638292
Article Google Scholar
Beretta M, Bjork J, Magnusson M (2018) Moderating ideation in web-enabled ideation systems. J Prod Innov Manag 35:389–409. https://doi.org/10.1111/jpim.12413
Article Google Scholar
Berners-Lee RFT, Masinter L (2015) Uniform Resource Identifier (URI): generic syntax, request for comments: 3986, January 2005
Bhargavan K, Delignat-Lavaud A, Maffeis S (2013) Language-based defenses against untrusted browser origins. In: USENIX security symposium, pp 653–670
Braaksma B, Zeelenberg K (2015) “Re-make/Re-model”: should big data change the modelling paradigm in official statistics? Stat J IAOS 31:193–202. https://doi.org/10.3233/sji-150892
Article Google Scholar
Castillo-Fernández O (2015) Web scraping: applications and tools. European Public Sector Information Platform
Chaabane S, Jaziri W (2018) A novel algorithm for fully automated mapping of geospatial ontologies. J Geogr Syst 20:85–105. https://doi.org/10.1007/s10109-017-0263-0
Article Google Scholar
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18:1411–1428. https://doi.org/10.1109/TKDE.2006.152
Article Google Scholar
Chen Z, Wenyin L, Zhang F et al (2001) Web mining for web image retrieval. J Am Soc Inform Sci Technol 52:831–839. https://doi.org/10.1002/asi.1132
Article Google Scholar
Chen M, Arribas-Bel D, Singleton A (2019) Understanding the dynamics of urban areas of interest through volunteered geographic information. J Geogr Syst 21:89–109. https://doi.org/10.1007/s10109-018-0284-3
Article Google Scholar
Denissen JJA, Neumann L, van Zalk M (2010) How the internet is changing the implementation of traditional research methods, people’s daily lives, and the way in which developmental scientists conduct research. Int J Behav Dev 34:564–575. https://doi.org/10.1177/0165025410383746
Article Google Scholar
Deniz C (2019) A command line program to get daily tv ratings in Turkey: https://github.com/coskundeniz/ratingpy
Dowell KG, McAndrews-Hill MS, Hill DP et al (2009) Integrating text mining into the MGI biocuration workflow. Database (Oxford). https://doi.org/10.1093/database/bap019
Article Google Scholar
Edelman B (2012) Using internet data for economic research. J Econ Perspect 26:189–206. https://doi.org/10.1257/jep.26.2.189
Article Google Scholar
Eluru N, Bhat CR, Pendyala RM, Konduri KC (2010) A joint flexible econometric model system of household residential location and vehicle fleet composition/usage choices. Transportation 37:603–626. https://doi.org/10.1007/s11116-010-9271-3
Article Google Scholar
Fernández P, Suárez JP, Trujillo A et al (2018) 3D-monitoring big geo data on a seaport infrastructure based on FIWARE. J Geogr Syst 20:139–157. https://doi.org/10.1007/s10109-018-0269-2
Article Google Scholar
Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: Proceedings. seventh international conference on Document analysis and recognition, 2003. IEEE, pp 1007–1013
Glavas C, Mathews S, Russell-Bennett R (2018) Knowledge acquisition via internet-enabled platforms: examining incrementally and non-incrementally internationalizing SMEs. Int Mark Rev 36:74–107. https://doi.org/10.1108/IMR-02-2017-0041
Article Google Scholar
González-Peña D, Lourenço A, López-Fernández H et al (2014) Web scraping technologies in an API world. Brief Bioinform 15:788–797
Article Google Scholar
Gök A, Waterworth A, Shapira P (2015) Use of web mining in studying innovation. Scientometrics 102:653–671. https://doi.org/10.1007/s11192-014-1434-0
Article Google Scholar
Graham M, Hogan B, Straumann RK, Medhat A (2014) Uneven geographies of user-generated information: patterns of increasing informational poverty. Ann Assoc Am Geogr 104:746–764. https://doi.org/10.1080/00045608.2014.910087
Article Google Scholar
Griffioen R, de Haan J, Willenborg L (2014) Collecting clothing data from the Internet. In: Proceedings of meeting of the group of experts on consumer price indexes, pp 26–28
Hadjar K, Rigamonti M, Lalanne D, Ingold R (2004) Xed: a new tool for extracting hidden structures from electronic documents. In: Proceedings of the first international workshop on document image analysis for libraries, 2004, pp 212–224
Hansen MC, Egorov A, Potapov PV et al (2014) Monitoring conterminous United States (CONUS) land cover change with Web-Enabled Landsat Data (WELD). Remote Sens Environ 140:466–484. https://doi.org/10.1016/j.rse.2013.08.014
Article Google Scholar
Herley C (2009) So long, and no thanks for the externalities: the rational rejection of security advice by users. In: Proceedings of the 2009 workshop on new security paradigms workshop. ACM, pp 133–144
Hooley T, Wellens J, Marriott J (2011) What is online research? Using the Internet for social science research. A&C Black
Howard P, Pulcini C, Levy Hara G et al (2015) An international cross-sectional survey of antimicrobial stewardship programmes in hospitals. J Antimicrob Chemother 70:1245. https://doi.org/10.1093/jac/dku497
Article Google Scholar
Jofre-Monseny J, Marín-López R, Viladecans-Marsal E (2011) The mechanisms of agglomeration: evidence from the effect of inter-industry relations on the location of new firms. J Urban Econ 70:61–74. https://doi.org/10.1016/j.jue.2011.05.002
Article Google Scholar
Kahn ME, Schwartz J (2008) Urban air pollution progress despite sprawl: the “greening” of the vehicle fleet. J Urban Econ 63:775–787. https://doi.org/10.1016/j.jue.2007.06.004
Article Google Scholar
Katre P (2019) Web scrapping and exploratory data analysis using beautiful soup and plotly on Indian demographics. katreparitosh/Web-Scrapping-and-EDA
Kumar SN (2015) World towards advance web mining: a review. Am J Syst Softw 3:44–61
Google Scholar
Lagacé E (2019) Python script to extract subway turnstile data files from the New York. MTA website: https://github.com/RollingHillsAnalytics/MTA-extraction
LeSage JP (2015) Software for Bayesian cross section and panel spatial model comparison. J Geogr Syst 17:297–310. https://doi.org/10.1007/s10109-015-0217-3
Article Google Scholar
Liu Y, Zhang M (2012) Financial websites oriented heuristic anti-phishing research. In: 2012 IEEE 2nd international conference on cloud computing and intelligence systems, pp 614–618
Mage D, Ozolins G, Peterson P et al (1996) Urban air pollution in megacities of the world. Atmos Environ 30:681–686. https://doi.org/10.1016/1352-2310(95)00219-7
Article Google Scholar
Marinai S (2009) Metadata extraction from PDF papers for digital library Ingest. In: 2009 10th International conference on document analysis and recognition, pp 251–255
Mehlführer A (2009) Web scraping: a tool evaluation. Master's Thesis, Wien University
Munzert S, Rubba C, Meisner P, Nyhuis D (2015) Automated data collection with R: a practical guide to web scraping and text mining. Wiley, Chichester, West Sussex, UK
National Research Council (2005) Expanding access to research data: reconciling risks and opportunities. Division of Behavioral and Social Sciences and Education, The National Academies Press, Washington, DC
Google Scholar
Navarro D (2019) This web scraper builds a dataset for São Paulo subway operation status. https://github.com/douglasnavarro/sp-subway-scraper
Nolan D, Temple Lang D (2014) XML and web technologies for data sciences with R. Springer, New York
Book Google Scholar
Nygaard R (2015) The use of online prices in the Norwegian Consumer Price Index. In: Meeting of the Ottowa Group, Tokyo, Japan
Papapesios N, Ellul C, Shakir A, Hart G (2019) Exploring the use of crowdsourced geographic information in defence: challenges and opportunities. J Geogr Syst 21:133–160. https://doi.org/10.1007/s10109-018-0282-5
Article Google Scholar
Paskaleva K, Cooper I (2018) Open innovation and the evaluation of internet-enabled public services in smart cities. Technovation 78:4–14. https://doi.org/10.1016/j.technovation.2018.07.003
Article Google Scholar
Penman RB, Baldwin T, Martinez D (2009) Web scraping made simple with sitescraper. Citeseer
Polidoro F, Giannini R, Conte RL et al (2015) Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Stat J IAOS 31:165–176
Article Google Scholar
Rey SJ, Anselin L (2006) Recent advances in software for spatial analysis in the social sciences. Geogr Anal 38:1–4. https://doi.org/10.1111/j.0016-7363.2005.00670.x
Article Google Scholar
Roy DP, Ju J, Kline K et al (2010) Web-Enabled Landsat Data (WELD): Landsat ETM+ composited mosaics of the conterminous United States. Remote Sens Environ 114:35–49. https://doi.org/10.1016/j.rse.2009.08.011
Article Google Scholar
Salamone S, Scannapieco SM, Scarnò M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of the Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna
Santiago G (2019) Web Scraping para coletar os dados da Folha de Pessoal dos Municípios (BA) no site do TCM-Ba: georgevbsantiago/tcmbapessoal
Sellers J (2019) Document-level sentiment analysis of book reviews scraped from the Goodreads website. Technologies used include TensorFlow, Spark, HDFS, Sqoop, Scrapy, and D3.js.: JohnSell620/sentiment-analysis-g
Siewert W, Udani A (2016) Missouri municipal ethics survey: Do ethics measures work at the municipal level? Public Integr 18:269–289. https://doi.org/10.1080/10999922.2016.1139523
Article Google Scholar
Skitka LJ, Sargis EG (2006) The internet as psychological laboratory. Annu Rev Psychol 57:529–555. https://doi.org/10.1146/annurev.psych.57.102904.190048
Article Google Scholar
Thaiprayoon S, Haruechaiyasak AKC (2016) PDF extraction based on lexical analysis for Thai texts. Int J Appl Comput Technol Inf Syst 5:7–9
Google Scholar
Vallone A, Chasco C, Sanchez B (2017) DataSpa: functions to collect Spanish data at municipality level. https://github.com/amvallone/DataSpa
Walker K, Eberwein K, Herman M (2019) tidycensus: load US census boundary and attribute data as “tidyverse” and ‘sf’-ready data frames. https://walkerke.github.io/tidycensus/. Accessed 5 Sept 2018
Wang H, Fu L, Lin X et al (2009) A bottom-up methodology to estimate vehicle emissions for the Beijing urban area. Sci Total Environ 407:1947–1953. https://doi.org/10.1016/j.scitotenv.2008.11.008
Article Google Scholar
Westling EL, Lerner DN, Sharp L (2009) Using secondary data to analyse socio-economic impacts of water management actions. J Environ Manag 91:411–422. https://doi.org/10.1016/j.jenvman.2009.09.011
Article Google Scholar
Wickham H (2016) Package ‘rvest’. https://cran.r-project.org/web/packages/rvest/rvest.pdf. Accessed 5 Sept 2018
Wickham H (2017) Package ‘stringr.’ https://cran.r-project.org/web/packages/stringr/stringr.pdf. Accessed 5 Sept 2018
William Xu X, Liu T (2003) A web-enabled PDM system in a collaborative design environment. Robot Comput Integr Manuf 19:315–328. https://doi.org/10.1016/S0736-5845(02)00082-0
Article Google Scholar
Wolf LJ (2019) cenpy: explore and download data from census APIs. https://github.com/ljwolf/cenpy. Accessed 5 Sept 2018
Wright KB (2005) Researching internet-based populations: advantages and disadvantages of online survey research, online questionnaire authoring software packages, and web survey services. J Comput Mediat Commun. https://doi.org/10.1111/j.1083-6101.2005.tb00259.x
Article Google Scholar
Xavier R (2019) Web scraping to obtain laws and decrees approved by the Uruguayan government: rxavier/volnormativo
Zagayevskiy Y, Deutsch CV (2016) Multivariate grid-free geostatistical simulation with point or block scale secondary data. Stoch Environ Res Risk Assess 30:1613–1633. https://doi.org/10.1007/s00477-015-1154-x
Article Google Scholar
Zuhair H, Selamat A, Salleh M (2016) New hybrid features for phish website prediction. Int J Adv Soft Comput Its Appl 8(1):28–43
Google Scholar

Download references

Acknowledgements

This work was supported by Spanish Ministry of Economics and Competitiveness (ECO2015-65758-P) and the Regional Government of Extremadura (Spain). The usual disclaimers apply.

Author information

Authors and Affiliations

Department of Applied Economics, Universidad Autónoma de Madrid, C/ Francisco Tomás y Valiente 5, 28049, Madrid, Spain
Coro Chasco
Nebrija University, C/ Sta. Cruz de Marcenado, 27, 28015, Madrid, Spain
Coro Chasco
Escuela de Ciencias Empresariales, Universidad Católica del Norte, Larrondo 1281, Coquimbo, Chile
Andrés Vallone
Department of Economics and Business, Catholic University of Ávila, Calle de los Canteros, s/n, Ávila, Spain
Beatriz Sánchez

Authors

Andrés Vallone
View author publications
You can also search for this author in PubMed Google Scholar
Coro Chasco
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Coro Chasco.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vallone, A., Chasco, C. & Sánchez, B. Strategies to access web-enabled urban spatial data for socioeconomic research using R functions. J Geogr Syst 22, 217–239 (2020). https://doi.org/10.1007/s10109-019-00309-y

Download citation

Received: 08 September 2018
Accepted: 13 August 2019
Published: 23 August 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10109-019-00309-y

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

Abstract

Access this article

Similar content being viewed by others

Spatial Analysis Meets Internet Research

Spatial Analysis Meets Internet Research

A free, open-source tool for identifying urban agglomerations using polygon data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

JEL Classification

Navigation

Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

Abstract

Access this article

Similar content being viewed by others

Spatial Analysis Meets Internet Research

Spatial Analysis Meets Internet Research

A free, open-source tool for identifying urban agglomerations using polygon data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation