当前位置: X-MOL 学术World Patent Information › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-label classification and interactive NLP-based visualization of electric vehicle patent data
World Patent Information ( IF 2.2 ) Pub Date : 2019-09-01 , DOI: 10.1016/j.wpi.2019.101903
Djavan De Clercq , Ndeye-Fatou Diop , Devina Jain , Benjamin Tan , Zongguo Wen

Abstract The objectives of this study are to (1) interactively visualize information embedded in patent texts, and (2) train a high-accuracy multi-label classification algorithm capable of classifying patents into multiple cooperative patent classification (CPC) classes. The case study involved metadata and text data of 17,500 electric vehicle patents. To these ends, the following methodology was applied: First, feature engineering was based on topic extraction from patent texts using latent dirichlet analysis (LDA) and the perplexity metric. Second, the multi-label implementations of the random forest, decision trees, and KNN algorithms were trained on the data in order to predict multiple class labels corresponding to a given electric vehicle patent. The results of this study were promising, with the best scores for performance metrics such as accuracy, precision, recall, f-score, and hamming loss being 0.91, 0.92, 0.74, and 0.02 respectively. The implications of our results are two-fold: firstly, we present the effectiveness of using open-source tools for customized patent analysis pipelines including interactive data visualization and machine learning. Secondly, our results provide a strong basis for automated multi-label patent classification into CPC classes.

中文翻译:

电动汽车专利数据的多标签分类和基于交互式 NLP 的可视化

摘要 本研究的目标是 (1) 交互式地可视化嵌入在专利文本中的信息,以及 (2) 训练一种能够将专利分类为多个合作专利分类 (CPC) 类别的高精度多标签分类算法。该案例研究涉及 17,500 项电动汽车专利的元数据和文本数据。为此,应用了以下方法:首先,特征工程基于使用潜在狄利克雷分析 (LDA) 和困惑度量从专利文本中提取主题。其次,随机森林、决策树和 KNN 算法的多标签实现在数据上进行了训练,以预测与给定电动汽车专利对应的多个类别标签。这项研究的结果很有希望,准确率、准确率、召回率、f-score 和汉明损失等性能指标的最佳分数分别为 0.91、0.92、0.74 和 0.02。我们的结果有两个含义:首先,我们展示了使用开源工具进行定制专利分析管道的有效性,包括交互式数据可视化和机器学习。其次,我们的结果为将多标签专利自动分类为 CPC 类别提供了强有力的基础。
更新日期:2019-09-01
down
wechat
bug