Abstract
Since the introduction of the World Wide Web in the 1990s, available information for research purposes has increased exponentially, leading to a significant proliferation of research based on web-enabled data. Nowadays the use of internet-enabled databases, obtained by either primary data online surveys or secondary official and non-official registers, is common. However, information disposal varies depending on data category and country and specifically, the collection of microdata at low geographical level for urban analysis can be a challenge. The most common difficulties when working with secondary web-enabled data can be grouped into two categories: accessibility and availability problems. Accessibility problems are present when the data publication in the servers blocks or delays the download process, which becomes a tedious reiterative task that can produce errors in the construction of big databases. Availability problems usually arise when official agencies restrict access to the information for statistical confidentiality reasons. In order to overcome some of these problems, this paper presents different strategies based on URL parsing, PDF text extraction, and web scraping. A set of functions, which are available under a GPL-2 license, were built in an R package to specifically extract and organize databases at the municipality level (NUTS 5) in Spain for population, unemployment, vehicle fleet, and firm characteristics.
Similar content being viewed by others
Notes
According Eurostat, The LAUs (Local Administrative Units) are subdivisions of the NUTS 3 regions, which consist of municipalities or equivalent units (formerly NUTS 5). The NUTS classification (Nomenclature of territorial units for statistics) is a hierarchical system for dividing the economic territory of the EU. NUTS 1 are major socio-economic regions (e.g. Spain), NUTS 2 are basic regions for the application of regional policies (e.g. autonomous community of Extremadura) and NUTS 3 are small regions for specific diagnoses (e.g. province of Badajoz).
This R package is freely available from the site https://github.com/amvallone/DataSpa. It must be installed in the R console with the command: devtools::install_github("amvallone/DataSpa").
All the R functions are in the aforementioned repository: https://github.com/amvallone/DataSpa.
These functions deal with two important difficulties derived from the construction of panels for municipality data in Spain. First, they control for municipality entries and removals, which take place almost every year, adapting the final data frame to the configuration corresponding to the last period. Second, they produce a list of name equivalences, based on the information provided by the INE, to manage with constant changes in the municipality names, always assigning the one corresponding the last period.
References
Arauzo Carod JM (2005) Determinants of industrial location: an application for Catalan municipalities*. Pap Reg Sci 84:105–120. https://doi.org/10.1111/j.1435-5957.2005.00006.x
Arauzo-Carod J-M, Viladecans-Marsal E (2009) Industrial location at the intra-metropolitan level: the role of agglomeration economies. Reg Stud 43:545–558. https://doi.org/10.1080/00343400701874172
Atkinson AB, Brandolini A (2001) Promise and pitfalls in the use of “secondary” data-sets: income inequality in OECD countries as a case study. J Econ Lit 39:771–799. https://doi.org/10.1257/jel.39.3.771
Aumueller D (2009) Retrieving metadata for your local scholarly papers. BTW
Beel J, Langer S, Genzmehr M, Müller C (2013) Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. ACM, pp 443–444
Bento AM, Cropper ML, Mobarak AM, Vinha K (2005) The effects of urban spatial structure on travel demand in the United States. Rev Econ Stat 87:466–478. https://doi.org/10.1162/0034653054638292
Beretta M, Bjork J, Magnusson M (2018) Moderating ideation in web-enabled ideation systems. J Prod Innov Manag 35:389–409. https://doi.org/10.1111/jpim.12413
Berners-Lee RFT, Masinter L (2015) Uniform Resource Identifier (URI): generic syntax, request for comments: 3986, January 2005
Bhargavan K, Delignat-Lavaud A, Maffeis S (2013) Language-based defenses against untrusted browser origins. In: USENIX security symposium, pp 653–670
Braaksma B, Zeelenberg K (2015) “Re-make/Re-model”: should big data change the modelling paradigm in official statistics? Stat J IAOS 31:193–202. https://doi.org/10.3233/sji-150892
Castillo-Fernández O (2015) Web scraping: applications and tools. European Public Sector Information Platform
Chaabane S, Jaziri W (2018) A novel algorithm for fully automated mapping of geospatial ontologies. J Geogr Syst 20:85–105. https://doi.org/10.1007/s10109-017-0263-0
Chang C-H, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18:1411–1428. https://doi.org/10.1109/TKDE.2006.152
Chen Z, Wenyin L, Zhang F et al (2001) Web mining for web image retrieval. J Am Soc Inform Sci Technol 52:831–839. https://doi.org/10.1002/asi.1132
Chen M, Arribas-Bel D, Singleton A (2019) Understanding the dynamics of urban areas of interest through volunteered geographic information. J Geogr Syst 21:89–109. https://doi.org/10.1007/s10109-018-0284-3
Denissen JJA, Neumann L, van Zalk M (2010) How the internet is changing the implementation of traditional research methods, people’s daily lives, and the way in which developmental scientists conduct research. Int J Behav Dev 34:564–575. https://doi.org/10.1177/0165025410383746
Deniz C (2019) A command line program to get daily tv ratings in Turkey: https://github.com/coskundeniz/ratingpy
Dowell KG, McAndrews-Hill MS, Hill DP et al (2009) Integrating text mining into the MGI biocuration workflow. Database (Oxford). https://doi.org/10.1093/database/bap019
Edelman B (2012) Using internet data for economic research. J Econ Perspect 26:189–206. https://doi.org/10.1257/jep.26.2.189
Eluru N, Bhat CR, Pendyala RM, Konduri KC (2010) A joint flexible econometric model system of household residential location and vehicle fleet composition/usage choices. Transportation 37:603–626. https://doi.org/10.1007/s11116-010-9271-3
Fernández P, Suárez JP, Trujillo A et al (2018) 3D-monitoring big geo data on a seaport infrastructure based on FIWARE. J Geogr Syst 20:139–157. https://doi.org/10.1007/s10109-018-0269-2
Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: Proceedings. seventh international conference on Document analysis and recognition, 2003. IEEE, pp 1007–1013
Glavas C, Mathews S, Russell-Bennett R (2018) Knowledge acquisition via internet-enabled platforms: examining incrementally and non-incrementally internationalizing SMEs. Int Mark Rev 36:74–107. https://doi.org/10.1108/IMR-02-2017-0041
González-Peña D, Lourenço A, López-Fernández H et al (2014) Web scraping technologies in an API world. Brief Bioinform 15:788–797
Gök A, Waterworth A, Shapira P (2015) Use of web mining in studying innovation. Scientometrics 102:653–671. https://doi.org/10.1007/s11192-014-1434-0
Graham M, Hogan B, Straumann RK, Medhat A (2014) Uneven geographies of user-generated information: patterns of increasing informational poverty. Ann Assoc Am Geogr 104:746–764. https://doi.org/10.1080/00045608.2014.910087
Griffioen R, de Haan J, Willenborg L (2014) Collecting clothing data from the Internet. In: Proceedings of meeting of the group of experts on consumer price indexes, pp 26–28
Hadjar K, Rigamonti M, Lalanne D, Ingold R (2004) Xed: a new tool for extracting hidden structures from electronic documents. In: Proceedings of the first international workshop on document image analysis for libraries, 2004, pp 212–224
Hansen MC, Egorov A, Potapov PV et al (2014) Monitoring conterminous United States (CONUS) land cover change with Web-Enabled Landsat Data (WELD). Remote Sens Environ 140:466–484. https://doi.org/10.1016/j.rse.2013.08.014
Herley C (2009) So long, and no thanks for the externalities: the rational rejection of security advice by users. In: Proceedings of the 2009 workshop on new security paradigms workshop. ACM, pp 133–144
Hooley T, Wellens J, Marriott J (2011) What is online research? Using the Internet for social science research. A&C Black
Howard P, Pulcini C, Levy Hara G et al (2015) An international cross-sectional survey of antimicrobial stewardship programmes in hospitals. J Antimicrob Chemother 70:1245. https://doi.org/10.1093/jac/dku497
Jofre-Monseny J, Marín-López R, Viladecans-Marsal E (2011) The mechanisms of agglomeration: evidence from the effect of inter-industry relations on the location of new firms. J Urban Econ 70:61–74. https://doi.org/10.1016/j.jue.2011.05.002
Kahn ME, Schwartz J (2008) Urban air pollution progress despite sprawl: the “greening” of the vehicle fleet. J Urban Econ 63:775–787. https://doi.org/10.1016/j.jue.2007.06.004
Katre P (2019) Web scrapping and exploratory data analysis using beautiful soup and plotly on Indian demographics. katreparitosh/Web-Scrapping-and-EDA
Kumar SN (2015) World towards advance web mining: a review. Am J Syst Softw 3:44–61
Lagacé E (2019) Python script to extract subway turnstile data files from the New York. MTA website: https://github.com/RollingHillsAnalytics/MTA-extraction
LeSage JP (2015) Software for Bayesian cross section and panel spatial model comparison. J Geogr Syst 17:297–310. https://doi.org/10.1007/s10109-015-0217-3
Liu Y, Zhang M (2012) Financial websites oriented heuristic anti-phishing research. In: 2012 IEEE 2nd international conference on cloud computing and intelligence systems, pp 614–618
Mage D, Ozolins G, Peterson P et al (1996) Urban air pollution in megacities of the world. Atmos Environ 30:681–686. https://doi.org/10.1016/1352-2310(95)00219-7
Marinai S (2009) Metadata extraction from PDF papers for digital library Ingest. In: 2009 10th International conference on document analysis and recognition, pp 251–255
Mehlführer A (2009) Web scraping: a tool evaluation. Master's Thesis, Wien University
Munzert S, Rubba C, Meisner P, Nyhuis D (2015) Automated data collection with R: a practical guide to web scraping and text mining. Wiley, Chichester, West Sussex, UK
National Research Council (2005) Expanding access to research data: reconciling risks and opportunities. Division of Behavioral and Social Sciences and Education, The National Academies Press, Washington, DC
Navarro D (2019) This web scraper builds a dataset for São Paulo subway operation status. https://github.com/douglasnavarro/sp-subway-scraper
Nolan D, Temple Lang D (2014) XML and web technologies for data sciences with R. Springer, New York
Nygaard R (2015) The use of online prices in the Norwegian Consumer Price Index. In: Meeting of the Ottowa Group, Tokyo, Japan
Papapesios N, Ellul C, Shakir A, Hart G (2019) Exploring the use of crowdsourced geographic information in defence: challenges and opportunities. J Geogr Syst 21:133–160. https://doi.org/10.1007/s10109-018-0282-5
Paskaleva K, Cooper I (2018) Open innovation and the evaluation of internet-enabled public services in smart cities. Technovation 78:4–14. https://doi.org/10.1016/j.technovation.2018.07.003
Penman RB, Baldwin T, Martinez D (2009) Web scraping made simple with sitescraper. Citeseer
Polidoro F, Giannini R, Conte RL et al (2015) Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation. Stat J IAOS 31:165–176
Rey SJ, Anselin L (2006) Recent advances in software for spatial analysis in the social sciences. Geogr Anal 38:1–4. https://doi.org/10.1111/j.0016-7363.2005.00670.x
Roy DP, Ju J, Kline K et al (2010) Web-Enabled Landsat Data (WELD): Landsat ETM+ composited mosaics of the conterminous United States. Remote Sens Environ 114:35–49. https://doi.org/10.1016/j.rse.2009.08.011
Salamone S, Scannapieco SM, Scarnò M (2014) Web scraping and web mining: new tools for official statistics. In: Proceedings of the Societa Italiana di Statistica (SIS 2014), Cagliari, Sardegna
Santiago G (2019) Web Scraping para coletar os dados da Folha de Pessoal dos Municípios (BA) no site do TCM-Ba: georgevbsantiago/tcmbapessoal
Sellers J (2019) Document-level sentiment analysis of book reviews scraped from the Goodreads website. Technologies used include TensorFlow, Spark, HDFS, Sqoop, Scrapy, and D3.js.: JohnSell620/sentiment-analysis-g
Siewert W, Udani A (2016) Missouri municipal ethics survey: Do ethics measures work at the municipal level? Public Integr 18:269–289. https://doi.org/10.1080/10999922.2016.1139523
Skitka LJ, Sargis EG (2006) The internet as psychological laboratory. Annu Rev Psychol 57:529–555. https://doi.org/10.1146/annurev.psych.57.102904.190048
Thaiprayoon S, Haruechaiyasak AKC (2016) PDF extraction based on lexical analysis for Thai texts. Int J Appl Comput Technol Inf Syst 5:7–9
Vallone A, Chasco C, Sanchez B (2017) DataSpa: functions to collect Spanish data at municipality level. https://github.com/amvallone/DataSpa
Walker K, Eberwein K, Herman M (2019) tidycensus: load US census boundary and attribute data as “tidyverse” and ‘sf’-ready data frames. https://walkerke.github.io/tidycensus/. Accessed 5 Sept 2018
Wang H, Fu L, Lin X et al (2009) A bottom-up methodology to estimate vehicle emissions for the Beijing urban area. Sci Total Environ 407:1947–1953. https://doi.org/10.1016/j.scitotenv.2008.11.008
Westling EL, Lerner DN, Sharp L (2009) Using secondary data to analyse socio-economic impacts of water management actions. J Environ Manag 91:411–422. https://doi.org/10.1016/j.jenvman.2009.09.011
Wickham H (2016) Package ‘rvest’. https://cran.r-project.org/web/packages/rvest/rvest.pdf. Accessed 5 Sept 2018
Wickham H (2017) Package ‘stringr.’ https://cran.r-project.org/web/packages/stringr/stringr.pdf. Accessed 5 Sept 2018
William Xu X, Liu T (2003) A web-enabled PDM system in a collaborative design environment. Robot Comput Integr Manuf 19:315–328. https://doi.org/10.1016/S0736-5845(02)00082-0
Wolf LJ (2019) cenpy: explore and download data from census APIs. https://github.com/ljwolf/cenpy. Accessed 5 Sept 2018
Wright KB (2005) Researching internet-based populations: advantages and disadvantages of online survey research, online questionnaire authoring software packages, and web survey services. J Comput Mediat Commun. https://doi.org/10.1111/j.1083-6101.2005.tb00259.x
Xavier R (2019) Web scraping to obtain laws and decrees approved by the Uruguayan government: rxavier/volnormativo
Zagayevskiy Y, Deutsch CV (2016) Multivariate grid-free geostatistical simulation with point or block scale secondary data. Stoch Environ Res Risk Assess 30:1613–1633. https://doi.org/10.1007/s00477-015-1154-x
Zuhair H, Selamat A, Salleh M (2016) New hybrid features for phish website prediction. Int J Adv Soft Comput Its Appl 8(1):28–43
Acknowledgements
This work was supported by Spanish Ministry of Economics and Competitiveness (ECO2015-65758-P) and the Regional Government of Extremadura (Spain). The usual disclaimers apply.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vallone, A., Chasco, C. & Sánchez, B. Strategies to access web-enabled urban spatial data for socioeconomic research using R functions. J Geogr Syst 22, 217–239 (2020). https://doi.org/10.1007/s10109-019-00309-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10109-019-00309-y