Leveraging Subword Embeddings for Multinational Address Parsing,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Leveraging Subword Embeddings for Multinational Address Parsing
arXiv - CS - Computation and Language Pub Date : 2020-06-29 , DOI: arxiv-2006.16152
Marouane Yassine, David Beauchemin, Fran\c{c}ois Laviolette, Luc Lamontagne

Address parsing consists of identifying the segments that make up an address such as a street name or a postal code. Because of its importance for tasks like record linkage, address parsing has been approached with many techniques. Neural network methods defined a new state-of-the-art for address parsing. While this approach yielded notable results, previous work has only focused on applying neural networks to achieve address parsing of addresses from one source country. We propose an approach in which we employ subword embeddings and a Recurrent Neural Network architecture to build a single model capable of learning to parse addresses from multiple countries at the same time while taking into account the difference in languages and address formatting systems. We achieved accuracies around 99% on the countries used for training with no pre-processing nor post-processing needed. We explore the possibility of transferring the address parsing knowledge obtained by training on some countries' addresses to others with no further training in a zero-shot transfer learning setting. We achieve good results for 80% of the countries (33 out of 41), almost 50% of which (20 out of 41) is near state-of-the-art performance. In addition, we propose an open-source Python implementation of our trained models.

中文翻译：

利用子字嵌入进行多国地址解析

地址解析包括识别构成地址的段，例如街道名称或邮政编码。由于它对记录链接等任务的重要性，地址解析已经通过许多技术进行了处理。神经网络方法定义了地址解析的最新技术。虽然这种方法产生了显着的结果，但之前的工作只关注应用神经网络来实现对来自一个来源国的地址的地址解析。我们提出了一种方法，在该方法中，我们采用子词嵌入和循环神经网络架构来构建单个模型，该模型能够同时学习解析来自多个国家的地址，同时考虑到语言和地址格式系统的差异。我们在用于训练的国家/地区实现了约 99% 的准确率，无需预处理或后处理。我们探索了将通过在一些国家的地址上训练获得的地址解析知识转移到其他国家而无需在零样本转移学习环境中进一步训练的可能性。我们在 80% 的国家/地区（41 个中的 33 个）取得了良好的结果，其中近 50%（41 个中的 20 个）接近最先进的性能。此外，我们提出了我们训练模型的开源 Python 实现。其中近 50%（41 个中的 20 个）接近最先进的性能。此外，我们提出了我们训练模型的开源 Python 实现。其中近 50%（41 个中的 20 个）接近最先进的性能。此外，我们提出了我们训练模型的开源 Python 实现。

更新日期：2020-10-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文