Elsevier

Pattern Recognition Letters

Volume 159, July 2022, Pages 204-210
Pattern Recognition Letters

Bi-discriminator GAN for tabular data synthesis

https://doi.org/10.1016/j.patrec.2022.05.023Get rights and content

Highlights

  • Developing a novel data preprocessing scheme using Chi-squared function.

  • Proposing a new conditional term for the generator network in a GAN setup.

  • Implementing a bi-discriminator GAN for stable training.

  • Designing straightforward architectures for generator and discriminator networks.

Abstract

This paper introduces a bi-discriminator GAN for synthesizing tabular datasets containing continuous, binary, and discrete columns. Our proposed approach employs an adapted preprocessing scheme and a novel conditional term using the χβ2 distribution for the generator network to more effectively capture the input sample distributions. Additionally, we implement straightforward yet effective architectures for discriminator networks aiming at providing more discriminative gradient information to the generator. Our experimental results on four benchmarking public datasets corroborates the superior performance of our GAN both in terms of likelihood fitness metric and machine learning efficacy.

Introduction

Tabular data is among the most common modalities which has been widely used for maintaining massive databases of financial institutions, insurance corporations, networking companies, healthcare industries, etc. [1], [2], [3], [4], [5]. These databases include immense combination of personal, confidential, and general records for every customer, client, and patient in different formats (e.g., continuous and discrete data types). Semantic patterns derivable in such records efficiently contribute to extract meaningful information for the benefit of companies in various aspects such as large-scale decision-making [6], risk management [7], long-term investment [8], fraud or unusual activity detection [9], etc. However, exploiting these patterns is a challenging task since tabular datasets are heterogeneous [10], [11] and they contain sparse representations of discrete and continuous records with low correlation compared to homogeneous datasets (e.g., speech, environmental audio, image, etc.) [12]. Unfortunately, extracting semantic relational patterns from heterogeneous datasets requires implementing costly data-driven algorithms [13], [14].

During the last decade and especially after the proliferation of deep learning (DL) algorithms, various cutting-edge approaches have been introduced for processing tabular datasets in different frameworks [15], [16], [17], particularly for synthesis purposes [18]. Presumably, this is due to two major applications. Firstly, complex DL algorithms configured in the generative adversarial network (GAN) [19] synthesis platforms can be used for augmenting sparse datasets with low cardinality and poor sample quality [20]. Often, this data augmentation procedure effectively triggers the semantic pattern extraction operations. Secondly, GAN-based synthesis approaches yield models capable of generating new records similar (and non-identical) to the ground-truth samples available in the original databases. The synthesized records can be used for development purposes such as extracting relational patterns [21] without publicly releasing the original dataset. This efficiently contributes to protect the privacy of people and clients whose their information is stored in the tabular datasets of companies. Our focus in this paper is on the latter application since it has been among the most demanding appeals of some large-scale financial institutions towards avoiding data leakage [22], [23]. Briefly, we make the following contributions in this paper:

  • (i)

    developing a novel data preprocessing scheme and defining a new conditional term (vector) for the generator network configured in a GAN synthesis setup,

  • (ii)

    implementing a bi-discriminator GAN for providing more gradient information to the generator network in order to improve its performance in runtime,

  • (iii)

    designing straightforward architectures for generator and discriminator networks.

The organization of this paper is as the following. Section 2 provides a summary of synthesis approaches based on the state-of-the-art GANs for tabular datasets. In Section 3, we explain the details of our proposed synthesis approach and finally in Section 4, we report and analyze our conducted experiments on four benchmarking databases.

Section snippets

Background: Tabular data synthesis

Over the past years, variational autoencoder (VAE[24]) and GAN frameworks have been recognized as the state-of-the-art approaches for data fusion, particularly in the context of tabular data synthesis [6]. Fundamentally, these two frameworks are similar to the baseline classical Bayesian network (CLBN) [25] and its variants such as private Bayesian network (PrivBN) [26]. Thus far, many modern forms of these generative models (e.g., [27], [28]) have been introduced and practically implemented

Proposed approach: Bi-discriminator class-conditional tabular GAN (BCT-GAN)

Our proposed tabular synthesis approach is based on the CT-GAN [47], however with three major improvements in normalizing categorical records (preprocessing), defining the conditional term for the generator, and designing the architecture of the generator and two discriminator networks. The motivation behind employing double discriminators in our synthesis setup is the possibility of gaining more gradient information for the benefit of the generator during training. However, we do not provide

Experiments

This section provides the details of our conducted experiments on four public datasets which have been benchmarked for tabular data processing, particularly for synthesis purposes [47]. Adult2and Census3 are among the selected datasets from the UCI online repository [67] and they contain extensive combinations of continuous, binary, and complex discrete records. Following the baseline approaches

Discussion

This section further discusses other important aspects of our proposed BCT-GAN. For avoiding any potential misinterpretation, we split this section into independent subsections as the following.

Conclusion

In this paper, we introduced a bi-discriminator GAN for synthesising large-scale tabular databases. The major novelties of our approach is firstly the development of a preprocessing scheme and secondly defining a solid conditional term for the generator network to improve the entire performance of the generative model. This term is a vector based on a masked function using χβ2 probability density function for more effectively constraining over the generator and consequently better capturing the

Declaration of Competing Interest

We are writing to you to declare that we do not have any competing interests neither with a person nor a research lab.

Acknowledgment

This work was funded by Fédération des Caisses Desjardins du Québec, IVADO Institution, and Mitacs accelerate program with agreement number IT25105.

References (80)

  • A.L. Buczak et al.

    A survey of data mining and machine learning methods for cyber security intrusion detection

    IEEE Communications surveys & tutorials

    (2015)
  • D. Ulmer et al.

    Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data

    Machine Learning for Health

    (2020)
  • L. Xu

    Synthesizing tabular data using conditional GAN

    (2020)
  • T. Aven et al.

    Risk management

    Risk Management and Governance

    (2010)
  • W. Kornfeld et al.

    Automatically locating, extracting and analyzing tabular data

    Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

    (1998)
  • F. Cartella et al.

    Adversarial attacks for tabular data: application to fraud detection and imbalanced data

    (2021)
  • A.P. Sheth et al.

    Federated database systems for managing distributed, heterogeneous, and autonomous databases

    ACM Computing Surveys (CSUR)

    (1990)
  • Y.R. Wang, S.E. Madnick, et al., A polygen model for heterogeneous database systems: The source tagging perspective...
  • V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and tabular data: A survey,...
  • M.H. Loorak et al.

    Exploring the possibilities of embedding heterogeneous data attributes in familiar visualizations

    IEEE Trans Vis Comput Graph

    (2016)
  • M.A. Khan et al.

    Toward developing efficient conv-ae-based intrusion detection system using heterogeneous dataset

    Electronics (Basel)

    (2020)
  • R. Socher et al.

    Deep learning for NLP (without Magic)

    Tutorial Abstracts of ACL 2012

    (2012)
  • M. Traquair et al.

    Deep learning for the detection of tabular information from electronic component datasheets

    2019 IEEE Symposium on Computers and Communications (ISCC)

    (2019)
  • Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, arXiv preprint...
  • S. Bourou et al.

    A review of tabular data synthesis using GANs on an IDS dataset

    Information

    (2021)
  • I. Goodfellow et al.

    Generative adversarial nets

    Adv Neural Inf Process Syst

    (2014)
  • D. Shanmugam, D. Blalock, G. Balakrishnan, J. Guttag, When and why test-time augmentation works, arXiv preprint...
  • M.S. Tsechansky et al.

    Mining relational patterns from multiple relational tables

    Decis Support Syst

    (1999)
  • S. Alneyadi et al.

    A survey on data leakage prevention systems

    Journal of Network and Computer Applications

    (2016)
  • C. Chow et al.

    Approximating discrete probability distributions with dependence trees

    IEEE Trans. Inf. Theory

    (1968)
  • J. Zhang et al.

    Privbayes: private data release via Bayesian networks

    ACM Transactions on Database Systems (TODS)

    (2017)
  • C. Ma et al.

    VAEM: a deep generative model for heterogeneous mixed type data

    Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual

    (2020)
  • N. Park et al.

    Data synthesis based on generative adversarial networks

    Proc. VLDB Endow.

    (2018)
  • A. Genevay, G. Peyré, M. Cuturi, Gan and vae from an optimal transport point of view, arXiv preprint arXiv:1706.01807...
  • L. Mi, M. Shen, J. Zhang, A probe towards understanding GAN and VAE models, arXiv preprint arXiv:1812.05676...
  • S. Feizi et al.

    Understanding gans in the LQG setting: formulation, generalization and stability

    IEEE J. Sel. Areas Inf. Theory

    (2020)
  • I. Goodfellow et al.

    Deep learning

    (2016)
  • C. Shi et al.

    Can-gan: conditioned-attention normalized gan for face age synthesis

    Pattern Recognit Lett

    (2020)
  • S. Liu et al.

    Face aging with contextual generative adversarial nets

    Proceedings of the 25th ACM international conference on Multimedia

    (2017)
  • S. Fang et al.

    Facial makeup transfer with gan for different aging faces

    J Vis Commun Image Represent

    (2022)
  • 1

    Supplementary materials and source codes are available at this github-Repo.

    View full text