当前位置: X-MOL 学术Complex Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
IHWC: intelligent hidden web crawler for harvesting data in urban domains
Complex & Intelligent Systems ( IF 5.8 ) Pub Date : 2021-07-24 , DOI: 10.1007/s40747-021-00471-1
Sawroop Kaur 1 , Aman Singh 1 , G. Geetha 2 , Xiaochun Cheng 3
Affiliation  

Due to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.



中文翻译:

IHWC:用于收集城市领域数据的智能隐藏网络爬虫

由于隐藏网络的庞大规模,搜索、检索和挖掘丰富且高质量的数据可能是一项艰巨的任务。此外,由于存在表单,因此无法轻松访问数据。表单是动态的、异构的并且分布在数万亿个网页上。大量的努力已经解决了利用隐藏网络来集成和挖掘丰富数据的问题。需要探索有效的技术以及在特殊情况下的应用,以实现有效的收获率。其中一个特殊领域是大气科学,其中隐藏的网络爬行最少实现,并且需要爬虫爬过巨大的网络以将搜索范围缩小到特定数据。在这项研究中,实现了一种用于在城市域中收集数据的智能隐藏网络爬虫(IHWC),以解决相关问题,例如域分类、防止穷举搜索和优先化 URL。该爬虫在整理污染相关数据方面也表现良好。爬虫以相关网页为目标,通过执行拒绝规则丢弃不相关的网页。为了获得更准确的集中抓取结果,ICHW 会优先抓取给定主题的网站。该爬虫实现了开发有效的隐藏网络爬虫的双重目标,该爬虫可以专注于不同领域,并检查其在搜索智慧城市污染数据中的集成。智慧城市的目标之一是减少污染。生成的爬取数据可用于查找污染原因。爬虫可以帮助用户搜索特定区域的污染程度。履带的收获率与先锋现有的工作进行了比较。随着数据集大小的增加,所提供的爬虫可以显着提高排放精度。我们的结果证明了所提出框架的准确性和收获率,并且它可以有效地从大型站点收集隐藏的 Web 界面,并且比其他爬虫获得更高的速率。

更新日期:2021-07-24
down
wechat
bug