Bi-discriminator GAN for tabular data synthesis

doi:10.1016/j.patrec.2022.05.023

Pattern Recognition Letters

Volume 159, July 2022, Pages 204-210

https://doi.org/10.1016/j.patrec.2022.05.023 Get rights and content

Highlights

•
Developing a novel data preprocessing scheme using Chi-squared function.
•
Proposing a new conditional term for the generator network in a GAN setup.
•
Implementing a bi-discriminator GAN for stable training.
•
Designing straightforward architectures for generator and discriminator networks.

Abstract

This paper introduces a bi-discriminator GAN for synthesizing tabular datasets containing continuous, binary, and discrete columns. Our proposed approach employs an adapted preprocessing scheme and a novel conditional term using the $χ_{β}^{2}$ distribution for the generator network to more effectively capture the input sample distributions. Additionally, we implement straightforward yet effective architectures for discriminator networks aiming at providing more discriminative gradient information to the generator. Our experimental results on four benchmarking public datasets corroborates the superior performance of our GAN both in terms of likelihood fitness metric and machine learning efficacy.

Introduction

Tabular data is among the most common modalities which has been widely used for maintaining massive databases of financial institutions, insurance corporations, networking companies, healthcare industries, etc. [1], [2], [3], [4], [5]. These databases include immense combination of personal, confidential, and general records for every customer, client, and patient in different formats (e.g., continuous and discrete data types). Semantic patterns derivable in such records efficiently contribute to extract meaningful information for the benefit of companies in various aspects such as large-scale decision-making [6], risk management [7], long-term investment [8], fraud or unusual activity detection [9], etc. However, exploiting these patterns is a challenging task since tabular datasets are heterogeneous [10], [11] and they contain sparse representations of discrete and continuous records with low correlation compared to homogeneous datasets (e.g., speech, environmental audio, image, etc.) [12]. Unfortunately, extracting semantic relational patterns from heterogeneous datasets requires implementing costly data-driven algorithms [13], [14].

During the last decade and especially after the proliferation of deep learning (DL) algorithms, various cutting-edge approaches have been introduced for processing tabular datasets in different frameworks [15], [16], [17], particularly for synthesis purposes [18]. Presumably, this is due to two major applications. Firstly, complex DL algorithms configured in the generative adversarial network (GAN) [19] synthesis platforms can be used for augmenting sparse datasets with low cardinality and poor sample quality [20]. Often, this data augmentation procedure effectively triggers the semantic pattern extraction operations. Secondly, GAN-based synthesis approaches yield models capable of generating new records similar (and non-identical) to the ground-truth samples available in the original databases. The synthesized records can be used for development purposes such as extracting relational patterns [21] without publicly releasing the original dataset. This efficiently contributes to protect the privacy of people and clients whose their information is stored in the tabular datasets of companies. Our focus in this paper is on the latter application since it has been among the most demanding appeals of some large-scale financial institutions towards avoiding data leakage [22], [23]. Briefly, we make the following contributions in this paper:

(i)
developing a novel data preprocessing scheme and defining a new conditional term (vector) for the generator network configured in a GAN synthesis setup,
(ii)
implementing a bi-discriminator GAN for providing more gradient information to the generator network in order to improve its performance in runtime,
(iii)
designing straightforward architectures for generator and discriminator networks.

The organization of this paper is as the following. Section 2 provides a summary of synthesis approaches based on the state-of-the-art GANs for tabular datasets. In Section 3, we explain the details of our proposed synthesis approach and finally in Section 4, we report and analyze our conducted experiments on four benchmarking databases.

Section snippets

Background: Tabular data synthesis

Over the past years, variational autoencoder (VAE[24]) and GAN frameworks have been recognized as the state-of-the-art approaches for data fusion, particularly in the context of tabular data synthesis [6]. Fundamentally, these two frameworks are similar to the baseline classical Bayesian network (CLBN) [25] and its variants such as private Bayesian network (PrivBN) [26]. Thus far, many modern forms of these generative models (e.g., [27], [28]) have been introduced and practically implemented

Proposed approach: Bi-discriminator class-conditional tabular GAN (BCT-GAN)

Our proposed tabular synthesis approach is based on the CT-GAN [47], however with three major improvements in normalizing categorical records (preprocessing), defining the conditional term for the generator, and designing the architecture of the generator and two discriminator networks. The motivation behind employing double discriminators in our synthesis setup is the possibility of gaining more gradient information for the benefit of the generator during training. However, we do not provide

Experiments

This section provides the details of our conducted experiments on four public datasets which have been benchmarked for tabular data processing, particularly for synthesis purposes [47]. Adult²and Census³ are among the selected datasets from the UCI online repository [67] and they contain extensive combinations of continuous, binary, and complex discrete records. Following the baseline approaches

Discussion

This section further discusses other important aspects of our proposed BCT-GAN. For avoiding any potential misinterpretation, we split this section into independent subsections as the following.

Conclusion

In this paper, we introduced a bi-discriminator GAN for synthesising large-scale tabular databases. The major novelties of our approach is firstly the development of a preprocessing scheme and secondly defining a solid conditional term for the generator network to improve the entire performance of the generative model. This term is a vector based on a masked function using $χ_{β}^{2}$ probability density function for more effectively constraining over the generator and consequently better capturing the

Declaration of Competing Interest

We are writing to you to declare that we do not have any competing interests neither with a person nor a research lab.

Acknowledgment

This work was funded by Fédération des Caisses Desjardins du Québec, IVADO Institution, and Mitacs accelerate program with agreement number IT25105.

References (80)

A. Shabtai et al.
A survey of data leakage detection and prevention solutions
(2012)
D.P. Kingma et al.
Auto-encoding variational bayes
2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings
(2014)
Y. Sun et al.
Facial age synthesis with label distribution-guided generative adversarial network
IEEE Trans. Inf. Forensics Secur.
(2020)
E. Choi et al.
Generating multi-label discrete patient records using generative adversarial networks
Machine Learning for Healthcare Conference
(2017)
D.P. Kingma et al.
Adam: A method for stochastic optimization
3rd Intl Conf Learn Repres
(2015)
L. Pardo
Statistical Inference Based on Divergence Measures
(2018)
M. Esmaeilpour et al.
Detection of adversarial attacks and characterization of adversarial subspace
IEEE Intl Conf Acoust, Speech and Signal Process
(2020)
A. Even et al.
Economics-driven data management: an application to the design of tabular data sets
IEEE Trans Knowl Data Eng
(2007)
R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, arXiv preprint arXiv:2106.03253...
J.M. Clements, D. Xu, N. Yousefi, D. Efimov, Sequential deep learning for credit risk monitoring with tabular financial...

A.L. Buczak et al.

A survey of data mining and machine learning methods for cyber security intrusion detection

IEEE Communications surveys & tutorials

(2015)

D. Ulmer et al.

Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data

Machine Learning for Health

(2020)

L. Xu

Synthesizing tabular data using conditional GAN

(2020)

T. Aven et al.

Risk management

Risk Management and Governance

(2010)

W. Kornfeld et al.

Automatically locating, extracting and analyzing tabular data

Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

(1998)

F. Cartella et al.

Adversarial attacks for tabular data: application to fraud detection and imbalanced data

(2021)

A.P. Sheth et al.

Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR)

(1990)

Y.R. Wang, S.E. Madnick, et al., A polygen model for heterogeneous database systems: The source tagging perspective...

V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and tabular data: A survey,...

M.H. Loorak et al.

Exploring the possibilities of embedding heterogeneous data attributes in familiar visualizations

IEEE Trans Vis Comput Graph

(2016)

M.A. Khan et al.

Toward developing efficient conv-ae-based intrusion detection system using heterogeneous dataset

Electronics (Basel)

(2020)

R. Socher et al.

Deep learning for NLP (without Magic)

Tutorial Abstracts of ACL 2012

(2012)

M. Traquair et al.

Deep learning for the detection of tabular information from electronic component datasheets

2019 IEEE Symposium on Computers and Communications (ISCC)

(2019)

Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, arXiv preprint...

S. Bourou et al.

A review of tabular data synthesis using GANs on an IDS dataset

Information

(2021)

I. Goodfellow et al.

Generative adversarial nets

Adv Neural Inf Process Syst

(2014)

D. Shanmugam, D. Blalock, G. Balakrishnan, J. Guttag, When and why test-time augmentation works, arXiv preprint...

M.S. Tsechansky et al.

Mining relational patterns from multiple relational tables

Decis Support Syst

(1999)

S. Alneyadi et al.

A survey on data leakage prevention systems

Journal of Network and Computer Applications

(2016)

C. Chow et al.

Approximating discrete probability distributions with dependence trees

IEEE Trans. Inf. Theory

(1968)

J. Zhang et al.

Privbayes: private data release via Bayesian networks

ACM Transactions on Database Systems (TODS)

(2017)

C. Ma et al.

VAEM: a deep generative model for heterogeneous mixed type data

Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual

(2020)

N. Park et al.

Data synthesis based on generative adversarial networks

Proc. VLDB Endow.

(2018)

A. Genevay, G. Peyré, M. Cuturi, Gan and vae from an optimal transport point of view, arXiv preprint arXiv:1706.01807...

L. Mi, M. Shen, J. Zhang, A probe towards understanding GAN and VAE models, arXiv preprint arXiv:1812.05676...

S. Feizi et al.

Understanding gans in the LQG setting: formulation, generalization and stability

IEEE J. Sel. Areas Inf. Theory

(2020)

I. Goodfellow et al.

Deep learning

(2016)

C. Shi et al.

Can-gan: conditioned-attention normalized gan for face age synthesis

Pattern Recognit Lett

(2020)

S. Liu et al.

Face aging with contextual generative adversarial nets

Proceedings of the 25th ACM international conference on Multimedia

(2017)

S. Fang et al.

Facial makeup transfer with gan for different aging faces

J Vis Commun Image Represent

(2022)

Cited by (6)

A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis
2024, Proceedings of the AAAI Conference on Artificial Intelligence
CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis
2024, IEEE Access
Super-Resolution Reconstruction of Particleboard Images Based on Improved SRGAN
2023, Forests
Raw Binary Data Usage with a Deep Learning Stack for Advanced Persistent Threat Attack Detection
2023, SSRN
Synthesizing Microbiome-Disease Association Data using GANs
2023, 2023 2nd International Conference on Advances in Computational Intelligence and Communication, ICACIC 2023
Poisoning the Competition: Fake Gradient Attacks on Distributed Generative Adversarial Networks
2023, Proceedings - 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems, MASS 2023

¹: Supplementary materials and source codes are available at this github-Repo.

View full text

Bi-discriminator GAN for tabular data synthesis

Highlights

Abstract

Introduction

Section snippets

Background: Tabular data synthesis

Proposed approach: Bi-discriminator class-conditional tabular GAN (BCT-GAN)

Experiments

Discussion

Conclusion

Declaration of Competing Interest

Acknowledgment

IEEE Trans. Inf. Forensics Secur.

Economics-driven data management: an application to the design of tabular data sets

IEEE Trans Knowl Data Eng

A survey of data mining and machine learning methods for cyber security intrusion detection

IEEE Communications surveys & tutorials

Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data

Machine Learning for Health

Synthesizing tabular data using conditional GAN

Risk management

Risk Management and Governance

Automatically locating, extracting and analyzing tabular data

Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Adversarial attacks for tabular data: application to fraud detection and imbalanced data

Federated database systems for managing distributed, heterogeneous, and autonomous databases

ACM Computing Surveys (CSUR)

Exploring the possibilities of embedding heterogeneous data attributes in familiar visualizations

IEEE Trans Vis Comput Graph

Toward developing efficient conv-ae-based intrusion detection system using heterogeneous dataset

Electronics (Basel)

Deep learning for NLP (without Magic)

Tutorial Abstracts of ACL 2012

Deep learning for the detection of tabular information from electronic component datasheets

2019 IEEE Symposium on Computers and Communications (ISCC)

A review of tabular data synthesis using GANs on an IDS dataset

Information

Generative adversarial nets

Adv Neural Inf Process Syst

Mining relational patterns from multiple relational tables

Decis Support Syst

A survey on data leakage prevention systems

Journal of Network and Computer Applications

Approximating discrete probability distributions with dependence trees

IEEE Trans. Inf. Theory

Privbayes: private data release via Bayesian networks

ACM Transactions on Database Systems (TODS)

VAEM: a deep generative model for heterogeneous mixed type data

Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual

Data synthesis based on generative adversarial networks

Proc. VLDB Endow.

Understanding gans in the LQG setting: formulation, generalization and stability

IEEE J. Sel. Areas Inf. Theory

Deep learning

Can-gan: conditioned-attention normalized gan for face age synthesis

Pattern Recognit Lett

Face aging with contextual generative adversarial nets

Proceedings of the 25th ACM international conference on Multimedia

Facial makeup transfer with gan for different aging faces

J Vis Commun Image Represent