Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities,Computer Networks

当前位置： X-MOL 学术 › Comput. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities
Computer Networks ( IF 4.4 ) Pub Date : 2021-07-27 , DOI: 10.1016/j.comnet.2021.108357
Daniel Perdices ₁ , Javier Ramos ₁ , José L. García-Dorado ₁ , Iván González _{1,

2} , Jorge E. López de Vergara _{1,

2}

Affiliation

In an Internet arena where the search engines and other digital marketing firms’ revenues peak, other actors still have open opportunities to monetize their users’ data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks’ links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF–IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization.

中文翻译：

用于网页浏览分析的自然语言处理：挑战、经验教训和机遇

在搜索引擎和其他数字营销公司收入达到顶峰的互联网领域，其他参与者仍然有机会将其用户数据货币化。在方便的匿名化、聚合和协议之后，用户访问的网站集可能会为 ISP 提供可利用的数据。用途涵盖从评估广告活动的范围到在其他营销方法中加强用户忠诚度以及安全问题。但是，基于 HTTP、DNS、TLS 或流特征的嗅探器不足以完成此任务。除了嵌入横幅、社交网络的链接、图像和来自其他网站的脚本之外，现代网站还设计用于预加载和预取某些内容。这种自触发的流量使得评估用户有意访问的网站变得混乱。而且，DNS 缓存甚至可以防止发送对主动访问的网站的某些查询。在这个有限的输入上，我们建议将诸如单词这样的域和作为文档的域序列处理。通过这种方式，可以通过将此问题转换为文本分类上下文并应用自然语言处理和神经网络领域最有前途的技术来识别访问过的网站。在不同场景和多个数据集下应用不同的表示方法（例如 TF-IDF、Word2vec、Doc2vec 和自定义神经网络）后，我们可以声明故意访问的网站，准确率超过 90%，峰值接近 100%，正在处理完全自动化，无需任何人工参数化。我们建议将这些域作为单词处理，将域序列作为文档处理。通过这种方式，可以通过将此问题转换为文本分类上下文并应用自然语言处理和神经网络领域最有前途的技术来识别访问过的网站。在不同场景和多个数据集下应用不同的表示方法（例如 TF-IDF、Word2vec、Doc2vec 和自定义神经网络）后，我们可以声明故意访问的网站，准确率超过 90%，峰值接近 100%，正在处理完全自动化，无需任何人工参数化。我们建议将这些域作为单词处理，将域序列作为文档处理。通过这种方式，可以通过将此问题转换为文本分类上下文并应用自然语言处理和神经网络领域最有前途的技术来识别访问过的网站。在不同场景和多个数据集下应用不同的表示方法（例如 TF-IDF、Word2vec、Doc2vec 和自定义神经网络）后，我们可以声明故意访问的网站，准确率超过 90%，峰值接近 100%，正在处理完全自动化，无需任何人工参数化。通过将此问题转换为文本分类上下文并应用自然语言处理和神经网络领域最有前途的技术，可以识别访问过的网站。在不同场景和多个数据集下应用不同的表示方法（例如 TF-IDF、Word2vec、Doc2vec 和自定义神经网络）后，我们可以声明故意访问的网站，准确率超过 90%，峰值接近 100%，正在处理完全自动化，无需任何人工参数化。通过将此问题转换为文本分类上下文并应用自然语言处理和神经网络领域最有前途的技术，可以识别访问过的网站。在不同场景和多个数据集下应用不同的表示方法（例如 TF-IDF、Word2vec、Doc2vec 和自定义神经网络）后，我们可以声明故意访问的网站，准确率超过 90%，峰值接近 100%，正在处理完全自动化，无需任何人工参数化。

更新日期：2021-08-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11