当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Survey on categorical data for neural networks
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-04-10 , DOI: 10.1186/s40537-020-00305-w
John T. Hancock , Taghi M. Khoshgoftaar

This survey investigates current techniques for representing qualitative data for use as input to neural networks. Techniques for using qualitative data in neural networks are well known. However, researchers continue to discover new variations or entirely new methods for working with categorical data in neural networks. Our primary contribution is to cover these representation techniques in a single work. Practitioners working with big data often have a need to encode categorical values in their datasets in order to leverage machine learning algorithms. Moreover, the size of data sets we consider as big data may cause one to reject some encoding techniques as impractical, due to their running time complexity. Neural networks take vectors of real numbers as inputs. One must use a technique to map qualitative values to numerical values before using them as input to a neural network. These techniques are known as embeddings, encodings, representations, or distributed representations. Another contribution this work makes is to provide references for the source code of various techniques, where we are able to verify the authenticity of the source code. We cover recent research in several domains where researchers use categorical data in neural networks. Some of these domains are natural language processing, fraud detection, and clinical document automation. This study provides a starting point for research in determining which techniques for preparing qualitative data for use with neural networks are best. It is our intention that the reader should use these implementations as a starting point to design experiments to evaluate various techniques for working with qualitative data in neural networks. The third contribution we make in this work is a new perspective on techniques for using categorical data in neural networks. We organize techniques for using categorical data in neural networks into three categories. We find three distinct patterns in techniques that identify a technique as determined, algorithmic, or automated. The fourth contribution we make is to identify several opportunities for future research. The form of the data that one uses as an input to a neural network is crucial for using neural networks effectively. This work is a tool for researchers to find the most effective technique for working with categorical data in neural networks, in big data settings. To the best of our knowledge this is the first in-depth look at techniques for working with categorical data in neural networks.

中文翻译:

神经网络分类数据调查

这项调查研究了用于表示定性数据的现有技术,以用作神经网络的输入。在神经网络中使用定性数据的技术是众所周知的。但是,研究人员继续发现在神经网络中处理分类数据的新方法或全新方法。我们的主要贡献是在一项工作中涵盖这些表示技术。为了利用机器学习算法,处理大数据的从业人员通常需要在其数据集中编码分类值。此外,我们认为大数据集的数据量可能会导致某些运行时间的复杂性,从而导致人们拒绝某些不可行的编码技术。神经网络将实数向量作为输入。在将定性值用作神经网络的输入之前,必须使用一种将定性值映射为数值的技术。这些技术被称为嵌入,编码,表示或分布式表示。这项工作的另一个贡献是为各种技术的源代码提供了参考,在这些技术中我们能够验证源代码的真实性。我们涵盖了研究人员在神经网络中使用分类数据的多个领域的最新研究。其中一些领域是自然语言处理,欺诈检测和临床文档自动化。这项研究为确定哪种准备与神经网络一起使用的定性数据的技术最好的研究提供了一个起点。我们的目的是让读者以这些实现为起点来设计实验,以评估用于处理神经网络中定性数据的各种技术。我们在这项工作中所做的第三项贡献是对在神经网络中使用分类数据的技术有了新的认识。我们将在神经网络中使用分类数据的技术分为三类。我们在技术中发现了三种不同的模式,这些模式将技术确定为确定的,算法的或自动化的。我们做出的第四项贡献是为未来的研究确定了一些机会。用作神经网络输入的数据形式对于有效使用神经网络至关重要。这项工作是研究人员在大数据设置中找到用于处理神经网络中分类数据的最有效技术的工具。据我们所知,这是首次深入研究在神经网络中使用分类数据的技术。
更新日期:2020-04-10
down
wechat
bug