Bi-discriminator GAN for tabular data synthesis
Introduction
Tabular data is among the most common modalities which has been widely used for maintaining massive databases of financial institutions, insurance corporations, networking companies, healthcare industries, etc. [1], [2], [3], [4], [5]. These databases include immense combination of personal, confidential, and general records for every customer, client, and patient in different formats (e.g., continuous and discrete data types). Semantic patterns derivable in such records efficiently contribute to extract meaningful information for the benefit of companies in various aspects such as large-scale decision-making [6], risk management [7], long-term investment [8], fraud or unusual activity detection [9], etc. However, exploiting these patterns is a challenging task since tabular datasets are heterogeneous [10], [11] and they contain sparse representations of discrete and continuous records with low correlation compared to homogeneous datasets (e.g., speech, environmental audio, image, etc.) [12]. Unfortunately, extracting semantic relational patterns from heterogeneous datasets requires implementing costly data-driven algorithms [13], [14].
During the last decade and especially after the proliferation of deep learning (DL) algorithms, various cutting-edge approaches have been introduced for processing tabular datasets in different frameworks [15], [16], [17], particularly for synthesis purposes [18]. Presumably, this is due to two major applications. Firstly, complex DL algorithms configured in the generative adversarial network (GAN) [19] synthesis platforms can be used for augmenting sparse datasets with low cardinality and poor sample quality [20]. Often, this data augmentation procedure effectively triggers the semantic pattern extraction operations. Secondly, GAN-based synthesis approaches yield models capable of generating new records similar (and non-identical) to the ground-truth samples available in the original databases. The synthesized records can be used for development purposes such as extracting relational patterns [21] without publicly releasing the original dataset. This efficiently contributes to protect the privacy of people and clients whose their information is stored in the tabular datasets of companies. Our focus in this paper is on the latter application since it has been among the most demanding appeals of some large-scale financial institutions towards avoiding data leakage [22], [23]. Briefly, we make the following contributions in this paper:
- (i)
developing a novel data preprocessing scheme and defining a new conditional term (vector) for the generator network configured in a GAN synthesis setup,
- (ii)
implementing a bi-discriminator GAN for providing more gradient information to the generator network in order to improve its performance in runtime,
- (iii)
designing straightforward architectures for generator and discriminator networks.
The organization of this paper is as the following. Section 2 provides a summary of synthesis approaches based on the state-of-the-art GANs for tabular datasets. In Section 3, we explain the details of our proposed synthesis approach and finally in Section 4, we report and analyze our conducted experiments on four benchmarking databases.
Section snippets
Background: Tabular data synthesis
Over the past years, variational autoencoder (VAE[24]) and GAN frameworks have been recognized as the state-of-the-art approaches for data fusion, particularly in the context of tabular data synthesis [6]. Fundamentally, these two frameworks are similar to the baseline classical Bayesian network (CLBN) [25] and its variants such as private Bayesian network (PrivBN) [26]. Thus far, many modern forms of these generative models (e.g., [27], [28]) have been introduced and practically implemented
Proposed approach: Bi-discriminator class-conditional tabular GAN (BCT-GAN)
Our proposed tabular synthesis approach is based on the CT-GAN [47], however with three major improvements in normalizing categorical records (preprocessing), defining the conditional term for the generator, and designing the architecture of the generator and two discriminator networks. The motivation behind employing double discriminators in our synthesis setup is the possibility of gaining more gradient information for the benefit of the generator during training. However, we do not provide
Experiments
This section provides the details of our conducted experiments on four public datasets which have been benchmarked for tabular data processing, particularly for synthesis purposes [47]. Adult2and Census3 are among the selected datasets from the UCI online repository [67] and they contain extensive combinations of continuous, binary, and complex discrete records. Following the baseline approaches
Discussion
This section further discusses other important aspects of our proposed BCT-GAN. For avoiding any potential misinterpretation, we split this section into independent subsections as the following.
Conclusion
In this paper, we introduced a bi-discriminator GAN for synthesising large-scale tabular databases. The major novelties of our approach is firstly the development of a preprocessing scheme and secondly defining a solid conditional term for the generator network to improve the entire performance of the generative model. This term is a vector based on a masked function using probability density function for more effectively constraining over the generator and consequently better capturing the
Declaration of Competing Interest
We are writing to you to declare that we do not have any competing interests neither with a person nor a research lab.
Acknowledgment
This work was funded by Fédération des Caisses Desjardins du Québec, IVADO Institution, and Mitacs accelerate program with agreement number IT25105.
References (80)
- et al.
A survey of data leakage detection and prevention solutions
(2012) - et al.
Auto-encoding variational bayes
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings
(2014) - et al.
Facial age synthesis with label distribution-guided generative adversarial network
IEEE Trans. Inf. Forensics Secur.
(2020) - et al.
Generating multi-label discrete patient records using generative adversarial networks
Machine Learning for Healthcare Conference
(2017) - et al.
Adam: A method for stochastic optimization
3rd Intl Conf Learn Repres
(2015) Statistical Inference Based on Divergence Measures
(2018)- et al.
Detection of adversarial attacks and characterization of adversarial subspace
IEEE Intl Conf Acoust, Speech and Signal Process
(2020) - et al.
Economics-driven data management: an application to the design of tabular data sets
IEEE Trans Knowl Data Eng
(2007) - R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, arXiv preprint arXiv:2106.03253...
- J.M. Clements, D. Xu, N. Yousefi, D. Efimov, Sequential deep learning for credit risk monitoring with tabular financial...
A survey of data mining and machine learning methods for cyber security intrusion detection
IEEE Communications surveys & tutorials
Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data
Machine Learning for Health
Synthesizing tabular data using conditional GAN
Risk management
Risk Management and Governance
Automatically locating, extracting and analyzing tabular data
Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Adversarial attacks for tabular data: application to fraud detection and imbalanced data
Federated database systems for managing distributed, heterogeneous, and autonomous databases
ACM Computing Surveys (CSUR)
Exploring the possibilities of embedding heterogeneous data attributes in familiar visualizations
IEEE Trans Vis Comput Graph
Toward developing efficient conv-ae-based intrusion detection system using heterogeneous dataset
Electronics (Basel)
Deep learning for NLP (without Magic)
Tutorial Abstracts of ACL 2012
Deep learning for the detection of tabular information from electronic component datasheets
2019 IEEE Symposium on Computers and Communications (ISCC)
A review of tabular data synthesis using GANs on an IDS dataset
Information
Generative adversarial nets
Adv Neural Inf Process Syst
Mining relational patterns from multiple relational tables
Decis Support Syst
A survey on data leakage prevention systems
Journal of Network and Computer Applications
Approximating discrete probability distributions with dependence trees
IEEE Trans. Inf. Theory
Privbayes: private data release via Bayesian networks
ACM Transactions on Database Systems (TODS)
VAEM: a deep generative model for heterogeneous mixed type data
Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual
Data synthesis based on generative adversarial networks
Proc. VLDB Endow.
Understanding gans in the LQG setting: formulation, generalization and stability
IEEE J. Sel. Areas Inf. Theory
Deep learning
Can-gan: conditioned-attention normalized gan for face age synthesis
Pattern Recognit Lett
Face aging with contextual generative adversarial nets
Proceedings of the 25th ACM international conference on Multimedia
Facial makeup transfer with gan for different aging faces
J Vis Commun Image Represent
Cited by (6)
A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis
2024, Proceedings of the AAAI Conference on Artificial IntelligenceSynthesizing Microbiome-Disease Association Data using GANs
2023, 2023 2nd International Conference on Advances in Computational Intelligence and Communication, ICACIC 2023Poisoning the Competition: Fake Gradient Attacks on Distributed Generative Adversarial Networks
2023, Proceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023
- 1
Supplementary materials and source codes are available at this github-Repo.