Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data,Peer-to-Peer Networking and Applications

当前位置： X-MOL 学术 › Peer-to-Peer Netw. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data
Peer-to-Peer Networking and Applications ( IF 3.3 ) Pub Date : 2019-12-16 , DOI: 10.1007/s12083-019-00841-0
Yong-Young Kim , Yong-Ki Kim , Dae-Sik Kim , Mi-Hye Kim

Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.

中文翻译：

使用AWS实现智能工作新闻大数据的混合P2P网络分布式Web爬虫

网络搜寻器收集并索引在线上可用的大量数据，以收集特定类型的客观数据，例如研究人员或从业人员需要的新闻。随着大数据在各个领域中越来越多地使用，并且网络数据每年都呈指数增长，网络爬虫的重要性也在增长。当前处理高流量的Web服务器（例如门户网站新闻服务器）可以抵御诸如分布式拒绝服务（DDoS）攻击等安全威胁。特别是，对Web服务器造成大量流量的爬网程序具有与DDoS攻击非常相似的性质，因此，爬网程序的活动倾向于被Web服务器阻止。可以使用对等（P2P）搜寻器来解决这些问题。然而，纯P2P搜寻器的局限性在于，当网络流量增加或发生错误时，很难维护整个系统。因此，为了克服这些限制，我们想提出一种混合P2P搜寻器，它可以使用Amazon Web Services（AWS）提供的云服务平台来收集Web数据。使用AWS（HP2PNC-AWS）的混合P2P网络分布式Web爬网程序用于从三个门户网站收集有关韩国当前智能工作生活方式的新闻。在目标服务器不阻止爬网的Portal A中，HP2PNC-AWS比常规Web爬网程序（GWC）快，但比服务器/客户端分布式Web爬网程序（SC-DWC）慢，但其性能与SC-DWC。但是，在目标服务器阻止爬网的Portal B和C中，HP2PNC-AWS的性能优于其他方法，同时具有收集速率和收集的数据数量。还证实了混合P2P网络系统可以在Web爬网程序体系结构中有效地工作。

更新日期：2019-12-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11