Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

Kim, Yong-Young; Kim, Yong-Ki; Kim, Dae-Sik; Kim, Mi-Hye

doi:10.1007/s12083-019-00841-0

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

Published: 16 December 2019

Volume 13, pages 659–670, (2020)
Cite this article

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Yong-Young Kim¹,
Yong-Ki Kim ORCID: orcid.org/0000-0002-8646-0758²,
Dae-Sik Kim² &
…
Mi-Hye Kim²

775 Accesses
11 Citations
Explore all metrics

Abstract

Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 7

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

A Proposal of a Big Web Data Application and Archive for the Distributed Data Processing with Apache Hadoop

References

Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
Article Google Scholar
De Mauro A, Greco M, Grimaldi M (2016) A Formal Definition of Big Data Based on Its Essential Features. Libr Rev 65(3):122–135. https://doi.org/10.1108/LR-06-2015-0061
Article Google Scholar
Philipp P, Maleshkova M, Rettinger A, Katic D (2017) A semantic framework for sequential decision making. Journal of Web Engineering 16(5–6):471–504
Google Scholar
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data Mining with Big Data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109
Article Google Scholar
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115. https://doi.org/10.1016/j.is.2014.07.006
Article Google Scholar
Ryu S, Song T-M (2014) Big data analysis in healthcare. Healthcare informatics research 20(4):247–248. https://doi.org/10.4258/hir.2014.20.4.247
Article Google Scholar
Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford WebBase components and applications. ACM Trans Internet Technol 6(2):153–186
Article Google Scholar
Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325
Article Google Scholar
Choudhary S, Dincturk E, Mirtaheri S, Bochmann GV, Jourdan G-V, Onut IV (2014) Model-based rich internet applications crawling: "menu" and "probability" models. Journal of Web Engineering 13(3–4):243–262
Google Scholar
Thenmalar S, Geetha TV (2014) The modified concept based focused crawling using ontology. Journal of Web Engineering 13(5–6):525–538
Google Scholar
Cho J, Garcia-Molina H (2002) Parallel crawlers. In: 11th international conference on world wide web, pp. 124-135. ACM
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1):161–172. https://doi.org/10.1016/S0169-7552(98)00108-1
Article Google Scholar
Heydon A, Najork M (1999) Mercator: a scalable. Extensible Web Crawler World Wide Web 2(4):219–229. https://doi.org/10.1023/a:1019213109274
Article Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book Google Scholar
Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717
Article Google Scholar
Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1–2):18–28
Article Google Scholar
Zhou B, Li J, Ji Y, Guizani M (2018) Online internet traffic monitoring and DDoS attack detection using Big Data frameworks. In 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC) (pp. 1507–1512). IEEE
Amazon Web Service. (2018.08.06). https://aws.amazon.com/
Xu H, Li K, Fan G (2018) An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P. In International Conference on Applications and Techniques in Cyber Security and Intelligence (pp. 849–855). Springer, Cham
Google Scholar
Hafaiedh, K. B., von Bochmann, G., Jourdan, G. V., Onut, I. V.: Fault Tolerant P2P RIA Crawling. In International Conference on Networked Systems (pp. 32–47). Springer, Cham (2016, May)
Ahmad S, Bouras C, Buyukkaya E, Dawood M, Hamzaoui R, Kapoulas V, Papazois A, Simon G (2018) Peer-to-peer live video streaming with Rateless codes for massively multiplayer online games. Peer-to-Peer Networking and Applications 11(1):44–62. https://doi.org/10.1007/s12083-016-0495-7
Article Google Scholar
Duan Z, Tian C, Zhou M, Wang X, Zhang N, Du H, Wang L (2017) Two-layer hybrid peer-to-peer networks. Peer-to-Peer Networking and Applications 10(6):1304–1322. https://doi.org/10.1007/s12083-016-0460-5
Article Google Scholar
Kim J-C, Chung K (2018) Mining health-risk factors using PHR similarity in a hybrid P2P network. Peer-to-Peer Networking and Applications 11(6):1278–1287. https://doi.org/10.1007/s12083-018-0631-7
Article Google Scholar
Kim Y-Y, Oh S, Lee H, Cha KJ (2015) A study on smart Workers' work/nonwork boundary management strategies. Knowledge Management Research 16(4):133–155
Google Scholar
Dixit DA (2012) Web crawler design issues: a review. International Journal of Managment, IT and Engineering 2(8):394–404
Google Scholar
Desai K, Devulapalli V, Agrawal S, Kathiria P, Patel A (2017) Web Crawler: Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities. Int J Adv Res Comput Sci 8(3)
Sozer EM, Stojanovic M, Proakis JG (2000) Underwater acoustic networks. IEEE J Ocean Eng 25(1):72–83
Article Google Scholar

Download references

Acknowledgments

This paper was supported by Konkuk University in 2018.

Author information

Authors and Affiliations

Division of International Business, Konkuk University, 268 Chungwon-daero, Chungju-si, Chungcheongbuk-do, 27478, Republic of Korea
Yong-Young Kim
Department of Computer Engineering, Chungbuk National University, 1 Chungdae-ro Seowon-gu, Cheongju-si, Chungcheongbuk-do, 28644, Republic of Korea
Yong-Ki Kim, Dae-Sik Kim & Mi-Hye Kim

Authors

Yong-Young Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Ki Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dae-Sik Kim
View author publications
You can also search for this author in PubMed Google Scholar
Mi-Hye Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong-Ki Kim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: Special Issue on P2P Computing for Intelligence of Things

Guest Editors: Sunmoon Jo, Jieun Lee, Jungsoo Han, and Supratip Ghose

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, YY., Kim, YK., Kim, DS. et al. Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data. Peer-to-Peer Netw. Appl. 13, 659–670 (2020). https://doi.org/10.1007/s12083-019-00841-0

Download citation

Received: 03 December 2018
Accepted: 25 October 2019
Published: 16 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s12083-019-00841-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

Abstract

Access this article

Similar content being viewed by others

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

A Proposal of a Big Web Data Application and Archive for the Distributed Data Processing with Apache Hadoop

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

Abstract

Access this article

Similar content being viewed by others

An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P

PDist-RIA Crawler: A Peer-to-Peer Distributed Crawler for Rich Internet Applications

A Proposal of a Big Web Data Application and Archive for the Distributed Data Processing with Apache Hadoop

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation