当前位置: X-MOL 学术Sci. Program. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Efficient Minimal Text Segmentation Method for URL Domain Names
Scientific Programming Pub Date : 2021-07-02 , DOI: 10.1155/2021/9946729
Yiqian Li 1 , Tao Du 1, 2 , Lianjiang Zhu 1 , Shouning Qu 1, 2
Affiliation  

Text segmentation of the URL domain name is a straightforward and convenient method to analyze users’ online behaviors and is crucial to determine their areas of interest. However, the performance of popular word segmentation tools is relatively low due to the unique structure of the website domain name (such as extremely short lengths, irregular names, and no contextual relationship). To address this issue, this paper proposes an efficient minimal text segmentation (EMTS) method for URL domain names to achieve efficient adaptive text mining. We first designed a targeted hierarchical task model to reduce noise interference in minimal texts. We then presented a novel method of integrating conflict game into the two-directional maximum matching algorithm, which can make the words with higher weight and greater probability to be selected, thereby enhancing the accuracy of recognition. Next, Chinese Pinyin and English mapping were embedded in the word segmentation rules. Besides, we incorporated a correction factor that considers the text length into the F1-score to optimize the performance evaluation of text segmentation. The experimental results show that the EMTS yielded around 20 percentage points improvement with other word segmentation tools in terms of accuracy and topic extraction, providing high-quality data for the subsequent text analysis.

中文翻译:

一种有效的 URL 域名最小文本分割方法

URL 域名的文本分割是分析用户在线行为的一种简单方便的方法,对于确定他们的兴趣领域至关重要。然而,流行的分词工具由于网站域名的独特结构(如长度极短、名称不规则、无上下文关系等),其性能相对较低。针对这个问题,本文提出了一种有效的URL域名最小文本分割(EMTS)方法,以实现高效的自适应文本挖掘。我们首先设计了一个有针对性的分层任务模型,以减少最小文本中的噪声干扰。然后我们提出了一种将冲突博弈集成到双向最大匹配算法中的新方法,该方法可以使具有更高权重和更大概率的词被选中,从而提高识别的准确性。接下来,中文拼音和英文映射被嵌入到分词规则中。此外,我们将考虑文本长度的校正因子纳入了F 1 -score 优化文本分割的性能评估。实验结果表明,与其他分词工具相比,EMTS 在准确率和主题提取方面取得了约 20 个百分点的提升,为后续的文本分析提供了高质量的数据。
更新日期:2021-07-02
down
wechat
bug