Skip to main content
Log in

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data

  • Published:
Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

Abstract

Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188

    Article  Google Scholar 

  2. De Mauro A, Greco M, Grimaldi M (2016) A Formal Definition of Big Data Based on Its Essential Features. Libr Rev 65(3):122–135. https://doi.org/10.1108/LR-06-2015-0061

    Article  Google Scholar 

  3. Philipp P, Maleshkova M, Rettinger A, Katic D (2017) A semantic framework for sequential decision making. Journal of Web Engineering 16(5–6):471–504

    Google Scholar 

  4. Wu X, Zhu X, Wu G-Q, Ding W (2014) Data Mining with Big Data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109

    Article  Google Scholar 

  5. Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115. https://doi.org/10.1016/j.is.2014.07.006

    Article  Google Scholar 

  6. Ryu S, Song T-M (2014) Big data analysis in healthcare. Healthcare informatics research 20(4):247–248. https://doi.org/10.4258/hir.2014.20.4.247

    Article  Google Scholar 

  7. Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford WebBase components and applications. ACM Trans Internet Technol 6(2):153–186

    Article  Google Scholar 

  8. Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325

    Article  Google Scholar 

  9. Choudhary S, Dincturk E, Mirtaheri S, Bochmann GV, Jourdan G-V, Onut IV (2014) Model-based rich internet applications crawling: "menu" and "probability" models. Journal of Web Engineering 13(3–4):243–262

    Google Scholar 

  10. Thenmalar S, Geetha TV (2014) The modified concept based focused crawling using ontology. Journal of Web Engineering 13(5–6):525–538

    Google Scholar 

  11. Cho J, Garcia-Molina H (2002) Parallel crawlers. In: 11th international conference on world wide web, pp. 124-135. ACM

  12. Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1):161–172. https://doi.org/10.1016/S0169-7552(98)00108-1

    Article  Google Scholar 

  13. Heydon A, Najork M (1999) Mercator: a scalable. Extensible Web Crawler World Wide Web 2(4):219–229. https://doi.org/10.1023/a:1019213109274

    Article  Google Scholar 

  14. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  Google Scholar 

  15. Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717

    Article  Google Scholar 

  16. Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1–2):18–28

    Article  Google Scholar 

  17. Zhou B, Li J, Ji Y, Guizani M (2018) Online internet traffic monitoring and DDoS attack detection using Big Data frameworks. In 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC) (pp. 1507–1512). IEEE

  18. Amazon Web Service. (2018.08.06). https://aws.amazon.com/

  19. Xu H, Li K, Fan G (2018) An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P. In International Conference on Applications and Techniques in Cyber Security and Intelligence (pp. 849–855). Springer, Cham

    Google Scholar 

  20. Hafaiedh, K. B., von Bochmann, G., Jourdan, G. V., Onut, I. V.: Fault Tolerant P2P RIA Crawling. In International Conference on Networked Systems (pp. 32–47). Springer, Cham (2016, May)

  21. Ahmad S, Bouras C, Buyukkaya E, Dawood M, Hamzaoui R, Kapoulas V, Papazois A, Simon G (2018) Peer-to-peer live video streaming with Rateless codes for massively multiplayer online games. Peer-to-Peer Networking and Applications 11(1):44–62. https://doi.org/10.1007/s12083-016-0495-7

    Article  Google Scholar 

  22. Duan Z, Tian C, Zhou M, Wang X, Zhang N, Du H, Wang L (2017) Two-layer hybrid peer-to-peer networks. Peer-to-Peer Networking and Applications 10(6):1304–1322. https://doi.org/10.1007/s12083-016-0460-5

    Article  Google Scholar 

  23. Kim J-C, Chung K (2018) Mining health-risk factors using PHR similarity in a hybrid P2P network. Peer-to-Peer Networking and Applications 11(6):1278–1287. https://doi.org/10.1007/s12083-018-0631-7

    Article  Google Scholar 

  24. Kim Y-Y, Oh S, Lee H, Cha KJ (2015) A study on smart Workers' work/nonwork boundary management strategies. Knowledge Management Research 16(4):133–155

    Google Scholar 

  25. Dixit DA (2012) Web crawler design issues: a review. International Journal of Managment, IT and Engineering 2(8):394–404

    Google Scholar 

  26. Desai K, Devulapalli V, Agrawal S, Kathiria P, Patel A (2017) Web Crawler: Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities. Int J Adv Res Comput Sci 8(3)

  27. Sozer EM, Stojanovic M, Proakis JG (2000) Underwater acoustic networks. IEEE J Ocean Eng 25(1):72–83

    Article  Google Scholar 

Download references

Acknowledgments

This paper was supported by Konkuk University in 2018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong-Ki Kim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection: Special Issue on P2P Computing for Intelligence of Things

Guest Editors: Sunmoon Jo, Jieun Lee, Jungsoo Han, and Supratip Ghose

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, YY., Kim, YK., Kim, DS. et al. Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data. Peer-to-Peer Netw. Appl. 13, 659–670 (2020). https://doi.org/10.1007/s12083-019-00841-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12083-019-00841-0

Keywords

Navigation