当前位置: X-MOL 学术Journal of Data and Information Science › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploring the Potentialities of Automatic Extraction of University Webometric Information
Journal of Data and Information Science Pub Date : 2020-11-01 , DOI: 10.2478/jdis-2020-0040
Gianpiero Bianchi 1 , Renato Bruni 2 , Cinzia Daraio 2 , Antonio Laureti Palma 1 , Giulio Perani 1 , Francesco Scalfati 1

Abstract Purpose The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities. Design/methodology/approach Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators. Findings The main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators. Research limitations The results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad. Practical implications The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems. Originality/value This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).



摘要目的 这项工作的主要目的是展示最近开发的直接从大学网站自动提取知识的方法的潜力。自动提取的信息可能会以每年一次以上的频率更新,并且不会被操纵或误解。此外,这种方法使我们能够灵活地收集有关大学网站效率及其传播关键内容的有效性的指标。这些新指标可以通过引入更多维度来补充传统的科学研究(例如文章数量和引用数量)和教学指标(例如学生和毕业生数量),从而为“分析”所分析的大学提供新的见解。设计/方法论/方法 Webometrics 依赖于 Web 挖掘方法和技术来执行 Web 的定量分析。本研究实现了网络计量方法的高级应用,利用了所有三类网络挖掘:网络内容挖掘;网络结构挖掘;网络使用挖掘。用于计算我们指标的信息是通过使用网络抓取和文本挖掘技术从大学的网站中提取的。抓取的信息已根据半结构化形式存储在 NoSQL DB 中,以便通过文本挖掘技术有效地检索信息。这为新指标的设计提供了更大的灵活性,为新型分析打开了大门。还通过搜索引擎(Bing,www.bing.com)的批量查询或从领先的网络分析提供商(SimilarWeb,http://www.similarweb.com)收集了一些数据。从网络中提取的信息已与从欧洲高等教育登记册 (https://eter.joanneum.at/#/home) 获取的大学结构信息相结合,该数据库收集欧洲高等教育机构 (HEIs) 的信息等级。以上所有内容均用于根据结构和数字指标对 79 所意大利大学进行集群化。调查结果 本研究的主要调查结果涉及对大学数字化潜力的评估,特别是通过介绍从网络中自动提取信息的技术,以建立大学网站质量和影响的指标。这些指标可以补充传统指标,并可用于使用与上述指标一起使用的聚类技术来识别具有共同特征的大学组。研究局限性 本研究报告的结果仅适用于意大利大学,但该方法可以扩展到国外的其他大学系统。实际意义 本研究中提出的方法及其对意大利大学的说明显示了最近引入的自动数据提取和网络抓取方法的有用性,以及它在根据大学网站表征和分析大学活动的实际相关性。该方法可以应用于其他大学系统。原创性/价值这项工作首次将一些最近引入的基于网络抓取、光学字符识别和非平凡文本挖掘操作的自动知识提取技术应用于大学网站(Bruni & Bianchi,2020)。