1 Introduction

Data analytics can be defined as the process of generating useful insights from raw data sets. Central to the entire process is the concept of data wrangling, which refers to the task of understanding, interpreting, preparing a raw data set and turning it into a usable format. This task can lead to a frustrating and time-consuming process for large data sets, and even possibly for some small sized ones (Dasu and Johnson 2003). In order to design corresponding transformations easily, one may desire to know characteristics of given data, e.g., data types, missing data, and anomalies. However, raw data often do not contain any well-documented prior information such as meta-data.

A data type such as integer, string, date, and float can be can be ascribed to a data column. Our focus here is on inferring the data type for each column in a table of data. Numerous studies have attempted to tackle type inference, including wrangling tools (Raman and Hellerstein 2001; Kandel et al. 2011; Guo et al. 2011; Trifacta 2018; Fisher and Gruber 2005; Fisher et al. 2008), software packages (Petricek et al. 2016; Lindenberg 2017; Stochastic Solutions 2018; Döhmen et al. 2017a; Wickham et al. 2017) and probabilistic approaches (Valera and Ghahramani 2017; Vergari et al. 2019; Limaye et al. 2010). However, often they do not work very well in the presence of missing and anomalous data, which are commonly found in raw data sets due to the lack of a well-organized data collection procedure.

One can develop a more accurate system by detecting missing and anomalous entries and then performing type inference on the valid data values only. However, distinguishing such entries can be challenging when missing data is not encoded explicitly, but as e.g., -1 or -99 in a column of integer type. Pearson (2006) refers to such problems as the problem of disguised missing data. He further shows how interpreting missing data as valid values can mislead the analysis. Similar problems are also discussed by Qahtan et al. (2018). Existing type inference methods do not consider such problems comprehensively. For example, they may fail when a column of integers contains string-valued missing data encodings such as null and NA. In this work, we incorporate such known missing data encodings into our probabilistic model.

Chandola et al. (2009) define an anomaly as “a pattern that does not conform to expected normal behavior”. In our context, anomalies refer to unexpected or invalid entries for a given column, which might result from the data collection procedure, e.g., error and warning messages generated by servers or the use of open-ended entries while collecting data. The challenge here is mostly due to the difficulty of modeling normal and anomalous data values such that the corresponding entries become distinguishable. Moreover, one may need to consider separate strategies to model anomalies for different data types since their structures may vary with respect to the data types. For example, anomalies in date columns might have a different form than those in integer columns.

Up to now, too little attention has been paid to the aforementioned problems by the data mining community. To this end, we introduce ptype, a probabilistic type inference method that can robustly annotate a column of data. The proposed model is built upon Probabilistic Finite-State Machines (PFSMs) that are used to model known data types, missing and anomalous data. In contrast to the standard use of regular expressions, PFSMs have the advantage of generating weighted posterior predictions even when a column of data is consistent with more than one type model. Our method generally outperforms existing type inference approaches for inferring data types, and also allows us to identify missing and anomalous data entries.

In the rest of this paper, we first describe PFSMs and introduce our model (Sect. 2). We then present the related work (Sect. 3), which is followed by the experiments and the results (Sect. 4). Finally, we summarize our work and discuss the possible future research directions (Sect. 5).

2 Methodology

This section consists of four parts. Section 2.1 gives background information on PFSMs used to model regular data types, missing data and anomalies. Section 2.2 introduces our model that uses a mixture of PFSMs. Lastly, Sects. 2.3 and  2.4 describe respectively inference in and training of this model.

The data type, missing data and anomalies can be defined in broad terms as follows: The data type is the common characteristic that is expected to be shared by entries in a column, such as integers, strings, IP addresses, dates, etc., while missing data denotes an absence of a data value which can be encoded in various ways, and anomalies refer to values whose types differ from the given column type or the missing type.

In order to model the above types, we have developed PFSMs that can generate values from the corresponding domains. This, in turn, allows us to calculate the probability of a given data value being generated by a particular PFSM. We then combine these PFSMs in our model such that a data column can be annotated via probabilistic inference in the proposed model, i.e., given a column of data, we can infer column type and rows with missing and anomalous values.

2.1 Probabilistic finite-state machines

Finite-State Machines (FSMs) are a class of mathematical models used to represent systems consisting of a finite number of internal states. The idea is to model a system by defining its states (including initial and final states), transitions between the states, and external inputs/outputs. FSMs have a long history going back at least to Rabin and Scott (1959) and Gill (1962). A more recent overview of FSMs is given by Hopcroft et al. (2001).

In this study, we are interested in a special type of FSMs called Probabilistic Finite-State Machines (PFSMs) in which transitions between states occur w.r.t. probability distributions (Paz 1971; Rabin 1963). Vidal et al. (2005) discuss various PFSMs in detail. Following a similar notation, we define a PFSM as a tuple \(A=\left( \theta , \varSigma , \delta , I, F, T\right) \), where \(\theta \) is a finite set of states, \(\varSigma \) is a set of observed symbols, \(\delta \subseteq \theta \times \varSigma \times \theta \) is a set of transitions among states w.r.t. to observed symbols, \(I : \theta \rightarrow \mathbb {R}^+\) is the initial-state probabilities, \(F : \theta \rightarrow \mathbb {R}^+\) is the final-state probabilities, and \(T : \delta \rightarrow \mathbb {R}^+\) is the transition probabilities for elements of \(\delta \). During each possible transition between states, a symbol is emitted. We denote such an event by a triple \((q,\alpha ,q')\), which corresponds to a transition from a state \(q \in \theta \) to a state \(q' \in \theta \) emitting a symbol \(\alpha \in \varSigma \). Note that \(\delta \) and T store respectively all the possible triples and their corresponding probabilities.

A PFSM has to satisfy certain conditions. First, the sum of the initial-state probabilities has to be equal to 1. Secondly, at each state q, it can either transition to another state \(q' \in \theta \) and emit a symbol \(\alpha \in \varSigma \), or stop at state q without emitting any symbol. This condition can be expressed mathematically as \(F(q) + \sum _{\alpha \in \varSigma , q' \in \theta } T(q,\alpha ,q')=1\) for each state q, where \(T(q,\alpha ,q')\) represents the probability of a triple \((q,\alpha ,q')\), and F(q) denotes the final-state probability of state q. Based on the definition given above, a PFSM can generate a set of strings, denoted by \(\varSigma ^*\).Footnote 1 For each string \(s \in \varSigma ^*\), we can calculate a probability that represents how likely it is for a given PFSM to generate the corresponding string.

Note that PFSMs resemble Hidden Markov Models (HMMs) except that we now have the final state probabilities. Recall that each state in a HMM has a probability distribution over the possible states that it can transition to. In PFSMs, each state also takes into account the probability of being a final state. Hence, the probability distribution is not only defined over the possible transitions to next states; it also includes the case of the current state being a final state. On the other hand, emissions are carried out similarly in PFSMs and HMMs: the observations are generated conditioned on the hidden states in HMMs; an observation is emitted through a transition in PFSMs since each transition is associated with a symbol. A detailed analysis of the link between PFSMs and HMMs can be found in Dupont et al. (2005).

One can develop PFSMs to represent types described previously and calculate the probabilities for each observed data value. We now explain the corresponding PFSMs in detail.

2.1.1 Representing types with PFSMs

Here, we show how a PFSM can be used to model a data type. We divide types into two groups: (i) primitive types consisting of integers, floats, Booleans and strings and (ii) complex types such as IP addresses, email addresses, phone numbers, dates, postcodes, genders and URLs. The details regarding the implementation of the corresponding PFSMs can be found in “Appendix A”.

Consider integer numbers whose domain is \([-\infty ,\infty ]\). We can represent the corresponding PFSM as in the diagram given in Fig. 1. The machine has two initial states, namely \(q_0\) and \(q_1\), and one final state \(q_2\). Here, \(q_0\) and \(q_1\) respectively allow us to represent integer numbers with a sign (plus or minus), or without any sign.Footnote 2 The machine eventually transitions to the state \(q_2\), which stops with a stopping probability \(F(q_2) = P_{stop}\). Otherwise, it transitions to itself by emitting a digit with an equal probability, \((1-P_{stop})/10\). Similarly, we can develop a PFSM to represent each one of the other column types.

Fig. 1
figure 1

Graphical representation of a PFSM with states \(\theta = \{q_0, q_1, q_2\}\) and alphabet \(\varSigma = \{\texttt {+}, \texttt {-}, 0, 1, \dots , 9\}\) where p denotes \(\frac{1-P_{stop}}{10}\)

The PFSM for missing values can be developed by using a predefined set of codes such as \(\{\texttt {-1}, \texttt {-9}, \texttt {-99}, \texttt {NA}, \texttt {NULL}, \texttt {N/A}, \texttt {-}, \dots \}\). It assigns non-zero probabilities to each element in this set. Note that the corresponding PFSM already supports a wide range of codes, but it can be easily extended by the user to incorporate other known missing data encodings, leading to semi-automated operation.

In order to model anomalies, we adapt the idea of an X-factor proposed by Quinn et al. (2009). We define a machine with the widest domain among all the PFSMs that supports all of the possible characters. This choice of PFSM lets the probabilistic model become more robust against anomalies since it assigns probabilities to data values that do not belong to any of the PFMSs representing the data types. Note that as this PFSM covers a wider domain, it will assign lower probabilities to known data types than the specific models.

Constructing PFSMs for complex types might require more human engineering than the other types. We reduce this need by building such PFSMs automatically from corresponding regular expressions. We first convert regular expressions to FSMs by using greenery (2018) (see the function named to_fsm()). We then build their probabilistic variants, where the parameters are assigned equal probabilities. Note that these parameters can be updated with the training.

2.2 The proposed model

We propose a new probabilistic mixture model with a noisy observation model, allowing us to detect missing and anomalous data entries. Our model first generates a column type from a set of possible regular data types. This is followed by a “deficiency” process that can potentially change the data type of each row. Consequently, each row might have a different type rather than the generated column type. The observation model then generates a data value for each entry according to its type. We now introduce our notation to represent this process.

We assume that a column of data \(\mathbf{x} =\{x_i\}_{i=1}^N\) has been read in, where each \(x_i\) denotes the characters in the ith row. We propose a generative model with a set of latent variables \(t \in \{1,2,\ldots ,K\}\) and \(\mathbf{z} =\{z_i\}_{i=1}^N\), where t and \(z_i\) respectively denote the data type of a column and its ith row. Here, N is the number of rows in a data column, and K is the number of possible data types for a column. We also use the additional missing and anomaly types, denoted by m and a respectively, and described above. Note that \(z_i\) can be of type m or a alongside a regular data type, i.e., \(z_i \in \{1,2,\ldots ,K,m,a\}\). This noisy observation model allows a type inference procedure robustified for missing and anomalous data values.

Hence the model has the following generative process:

$$\begin{aligned} \text {column type } t\sim & {} \mathcal {U}(1,K), \nonumber \\ \text {row type } z_i= & {} {\left\{ \begin{array}{ll} t &{} \text {with probability } \pi _t^t\ ,\\ m &{} \text {with probability } \pi _t^m,\\ a &{} \text {with probability } \pi _t^a\ ,\\ \end{array}\right. } \nonumber \\ \text {row value } x_i\sim & {} p(x_i | z_i), \end{aligned}$$

where \(\varvec{\varPi }\) and \(p(x_i|z_i)\) are respectively the model parameter and the observation model. \(\mathcal {U}\) denotes a discrete Uniform distribution. Here \(\pi _t^t + \pi _t^m + \pi _t^a = 1\) for each column type t. Since entries are often expected to be of a regular data type rather than missing or anomaly types, we favour regular types during inference by using lower coefficients for missing and anomaly types, i.e., \(\pi _t^m < \pi _t^t\) and \(\pi _t^a < \pi _t^t\). These weight parameters \(\varvec{\varPi }\) are assumed to be fixed and known. Even though one can also learn the weights, this may not be vital as long as the coefficients of the regular types are larger than the others.

We use a uniform distribution for column types as in most cases we do not have any prior information regarding the type of a column that would allow us to favour a particular one. To represent the conditional distribution \(p(x_i | z_i)\), we have developed PFSMs as described above.

2.3 Inference

Given a data column \(\mathbf{x} \), our initial goal is to infer the column type t, which is cast to the problem of calculating the posterior distribution of t given \(\mathbf{x} \), denoted by \(p(t|\mathbf{x} )\). Then, we assume that each row can be of three types: (i) the same as the column type, (ii) the missing type, and (iii) the anomaly type. In order to identify missing or anomalous data entries, we calculate the posterior probabilities of each row type, namely \(p(z_i|t, \mathbf{x} )\). In this section, we now briefly discuss the corresponding calculations; detailed derivations are presented in “Appendix D”.

Assuming that entries of a data column are conditionally independent given t, the posterior distribution of column type t is given as follows:

$$\begin{aligned} p(t=k|\mathbf{x} ) \propto p(t=k) \prod _{i=1}^N&\Big ( \pi _k^k p(x_i | z_i = k) \; \nonumber \\&\quad +\,\pi _k^m p(x_i | z_i = m) + \pi _k^a p(x_i | z_i = a) \Big ), \end{aligned}$$
(1)

which can be used to estimate the column type t, since the one with maximum posterior probability is the most likely data type corresponding to the column \(\mathbf{x} \).

As per Eq. 1, the model estimates the column type by considering all the data rows, i.e., having missing data or anomalies does not confuse the type inference. Note that such entries would have similar likelihoods for each column type, which allows the model to choose the dominant data type for regular entries.

Following the inference of column type, we can also identify entries of \(\mathbf{x} \) which are more likely to be missing or anomalies rather than the inferred type. For this, we compare the posterior probabilities of each row type \(z_i\) given \(t=k\) and \(x_i\), namely \(p(z_i=j | t=k, x_i)\), which can be written as:

$$\begin{aligned} p(z_i=j | t=k, x_i) = \frac{\pi ^{j}_{k} p(x_i | z_i=j)}{\sum _{\ell \in \{ k, m, a\}} \pi ^{\ell }_{k} p(x_i | z_i=\ell )}. \end{aligned}$$
(2)

2.3.1 Complexity analysis

The computational bottleneck in the inference is the calculation of \(p(\mathbf{x} | t = k)\) for each type k, which is the calculation of the probability assigned for a data column \(\mathbf{x} \) by the kth PFSM. Note that this calculation can be carried out by taking into account the counts of the unique data entries, for efficiency. Denoting the uth unique data value by \(x_u\), we need to consider the complexity of calculating \(p(x_u | t = k)\) which can be done via the PFSM Forward algorithm. Each iteration of the algorithm has the complexity of \(O(M_k^2)\) where \(M_k\) is the number of hidden states in the kth PFSM. As the number of iterations equals to the length of \(x_u\) denoted by L, the overall complexity of the inference becomes \(O(U K M^2 L)\), where U is the number of unique data entries, K is the number of types, and M is the maximum number of hidden states in the PFSMs.

2.4 Training of the model

The aim of the training is to tune the probabilities assigned by the PFSMs so that column types are inferred accurately. Given a set of columns and their annotated column types, the task is to find the parameters of the PFSMs (i.e., the initial-state, the transition and the final-state probabilities) that allow the “correct” machine to give higher probabilities to the observed entries. This is crucial as multiple PFSMs can assign non-zero probabilities for certain strings, e.g., \(\texttt {1}\), \(\texttt {True}\), etc.

We employ a discriminative training procedure on our generative model, as done in discriminative training of HMMs (Bahl et al. 1986; Jiang 2010; Brown 1987; Nádas et al. 1988; Williams and Hinton 1991). This is shown to be generally superior to maximum likelihood estimations (Jiang 2010), since a discriminative criterion is more consistent with the task being optimized. Moreover, it allows us to update not only the parameters of the “correct” PFSM but also the parameters of the other PFSMs given a column of data and its type, which in turn helps the correct one to generate the highest probability.

We choose \(\sum _{j=1}^M \log p(t^j|\mathbf {x}^j)\) as the objective function to maximize, where \(\mathbf {x}^j\) and \(t^j\) respectively denote the jth column of a given data matrix X and its type, and M is the number of columns. We then apply the Conjugate Gradient algorithm to find the parameters that maximize this objective function (please see “Appendix C” for detailed derivations of the gradients).

We study different parameter settings for our model. We first explore tuning the parameters by hand to incorporate certain preferences over the types, e.g., Boolean over Integer for \(\texttt {1}\). Then, we learn the parameters via the discriminative training described above where the parameters are initialized at the hand-crafted values. Note that due to the absence of explicit labels for the missing and anomaly types, these are not updated from the hand-crafted parameters. We have also employed the training by initializing the parameters uniformly. However, we do not report these results as they are not competitive with the others.

As the PFSMs are generative models it would be possible to train them unsupervised, to maximize \(\sum _{i, j} \log p(x^j_i)\), where \(p(x^i_j)\) is defined as a mixture model over all types (including missing and anomaly) for the ith row and jth column. The component PFSMs could then be updated using the Expectation-Maximization (EM) algorithm. However, such training would be unlikely to give as good classification performance as supervised training.

3 Related work

Numerous studies have attempted to tackle type inference, including wrangling tools and software libraries. Such approaches have made limited use of probabilistic methods, while the existing probabilistic approaches do not address the same type inference problem that we do. Below we discuss the capabilities of these existing methods in the presence of missing and anomalous data.

Type inference is commonly carried out by validation functions and regular expressions. For example, several wrangling tools, including Trifacta (2018) and its preceding versions (Raman and Hellerstein 2001; Kandel et al. 2011; Guo et al. 2011) apply validation functions to a sample of data to infer types, e.g., assign the one validated for more than half of the non-missing entries as the column type (Kandel et al. 2011). When multiple types satisfy this criterion, the more specific one is chosen as the column type, e.g., an integer is assumed to be more specific than a float. Trifacta supports a comprehensive set of data types, and provides an automatic discrepancy detector to detect errors in data (Raman and Hellerstein 2001). However, in our experience, its performance on type inference can be limited on messy data sets.

Fisher and Gruber (2005) and Fisher et al. (2008) have developed data description languages for processing ad hoc data (PADS) which enables generating a human-readable description of a dataset, based on data types inferred using regular expressions. However, their focus is on learning regular expressions to describe a dataset, rather than classifying the data columns into known types. Unlike the PADS library which parses a data file itself, Test-Driven Data Analytics (TDDA, Stochastic Solutions 2018) uses the Pandas CSV reader to read the data into a data frame. It then uses the Pandas dtypes attributes,Footnote 3 to determine the data type of the columns. However, this leads to a poor type detection performance since the Pandas reader is not robust against missing data and anomalies, where only empty string, NaN and NULL are treated as missing data.

Petricek et al. (2016) propose another use of regular expressions with F#, where types, referred as shapes, are inferred w.r.t. a set of preferred shape relations. Such relations are used to resolve ambiguous cases in which a data value fits multiple types. This procedure allows integrating inferred types into the process of coding, which can be useful to interact with data. However, it does not address the problems of missing and anomalous data comprehensively, where only three encodings of missing data, NA, #NA and :, are supported. This may lead to poor performance on type inference when missing data is encoded differently, or there are anomalies such as error messages.

A set of software packages in R and Python can also infer data types. messytables (Lindenberg 2017) determines the most probable type for a column by weighting the number of successful conversions of its elements to each type. This strategy can potentially help to cope with certain data errors; however, it might be difficult to find an effective configuration of the weights for a good performance. Moreover, it cannot handle the disguised missing data values, e.g., -1 in a column of type integer, which can be misleading for the analysis. Döhmen et al. (2017b) propose a CSV parser named hypoparsr that treats type inference as a parsing step. It takes into account a wide range of missing data encodings; however, it does not address anomalies, leading to poor performance on type inference. readr (Wickham et al. 2017) is an R package to read tabular data, such as CSV and TSV files. However, in contrast to hypoparsr, a limited set of values are considered as missing data, unless the user specifies otherwise. Furthermore, it employs a heuristic search procedure using a set of matching criteria for type inference. The details regarding the criteria are given in Wickham and Grolemund (2016). The search continues until one criterion is satisfied, which can be applied successfully in certain scenarios. Whenever such conditions do not hold, the column type is assigned to string.

Valera and Ghahramani (2017) propose a model for discovery of statistical types, but this model is tackling a very different problem than the one that we address. Their method assumes that all of the entries in a column contain numerical data values, i.e., it cannot handle data types such as string, date, etc. Moreover, it can only handle clean integer and float columns that do not have any missing data and anomalies. Given this, they address the problem of making fine-grained distinctions between different types of continuous variables (real-valued data, positive real-valued data and interval data) and discrete variables (categorical data, ordinal data and count data). It would be possible to combine their method with ours to provide more fine-grained distinctions of real-valued and discrete-valued data once the issues of detecting missing and anomalous data have been resolved.

Valera and Ghahramani (2017) is extended in Vergari et al. (2019) where the proposed method can be used to impute missing data and detect numerical anomalies. However, they also assume that each entry of a data column contains a numerical value or an explicit missing data indicator. Their method cannot be used when a data column contains non-numerical missing data encodings and anomalies. Moreover, the authors address a completely different missing data problem, i.e., they tackle missing data imputation rather than missing data detection. This means that the proposed method cannot detect whether an entry is missing or not, but it can impute missing data once the location of missing data is known. Lastly, their focus on anomaly detection is also different than ours in the sense that they attempt to detect numerical outliers, but do not detect string anomalies in a column.

Limaye et al. (2010) propose a log-linear model based method to annotate a table in terms of the semantic column type, cell entity and column relations, given an ontology of relevant information. For example, given a table that contains information about actors such as names and ages, the task is to find semantic column types such as actor and age, the entities each row refers to, and the relationship of actor-age between two columns. Even though the scope of these annotations are wider than ours, missing data and anomalies are not considered explicitly, and an appropriate ontology is required.

4 Experiments

In this section, we first in Sect. 4.1 describe the datasets and evaluation metrics used, and then in Sect. 4.2 compare ptype with competitor methods on two tasks: (i) column type inference and (ii) type/non-type inference. Lastly, we present a qualitative evaluation of our method on challenging cases in Sect. 4.3. The goal of our experiments is to evaluate (i) the robustness of the methods against missing data and anomalies for column type inference and (ii) the effectiveness of type/non-type inference. These are evaluated both quantitatively (Sect. 4.2) and quantitatively (Sect. 4.3). We release our implementation of the proposed method at https://github.com/tahaceritli/ptype-dmkd.

4.1 Experimental setup

We have trained ptype on 25 and tested on 43 data sets obtained from various sources including UCI ML,Footnote 4https://data.gov.uk, https://ukdataservice.ac.uk and https://data.gov. The data types were annotated by hand for these sets. We also annotated each dataset in terms of missing data and anomalies by using the available meta-data and checking the unique values.

On column type inference, we compare our method with F# (Petricek et al. 2016), hypoparsr (Döhmen et al. 2017a), messytables (Lindenberg 2017), readr (Wickham et al. 2017), TDDA (Stochastic Solutions 2018) and Trifacta (2018). Note that some of the related works are not directly applicable to this task, and these are not included in these experiments. For example, we are not able to use Raman and Hellerstein (2001), Kandel et al. (2011) and Guo et al. (2011) as they are no longer available. However, we use their latest version Trifacta in our experiments. We also exclude the PADS library (Fisher and Gruber 2005; Fisher et al. 2008), since it does not necessarily produce columns and their types. The methods proposed by Valera and Ghahramani (2017) and Vergari et al. (2019) are also not applicable to this task. First, they do not consider data types of Boolean, string, date. Secondly, they only address integer and float columns that do not contain any non-numerical missing data or anomalies, which are commonly found in real-world datasets. Note that Vergari et al. (2019) do not address missing data detection but missing data imputation, and can only handle numerical outliers but not non-numerical outliers, whereas Valera and Ghahramani (2017) do not address these questions. Lastly, we exclude the method presented by Limaye et al. (2010) as their goal is to infer semantic entity types rather than syntactic data types.

On type/non-type inference, we evaluate the performance of our method in detecting missing data and anomalies. We label such entries as non-type and classify each entry as either type or non-type. On this task, we compare our method with Trifacta only, as it is the leading competitor method on column type inference, and the others do not address this task comprehensively.

4.1.1 Data sets

We have conducted experiments on the data sets chosen according to two criteria: (i) coverage of the data types and (ii) data quality. In our experiments, we consider five common column types, including Boolean, date, float, integer and string. Table 1 presents the distribution of the column types found in our data sets. Any other columns not conforming to the supported data types are omitted from the evaluations. Secondly, we have selected messy data sets in order to evaluate the robustness against missing data and anomalies. The fraction of the non-type entries in the test datasets can be as large as 0.56, while the average fraction is 0.11 (see Fig. 2 for the details). Note that available data sets, their descriptions and the corresponding annotations can be accessed via https://goo.gl/v298ER.

Fig. 2
figure 2

Fraction of the non-type entries in a dataset, calculated by aggregating over its columns. Note that ‘overall’ denotes the fraction after aggregating over the datasets

Table 1 Histogram of the column types observed in the training and the test data sets

4.1.2 Evaluation metrics

For column type inference, we first evaluate the overall accuracy of the methods on type inference by using the accuracy. However, this evaluation may not be informative enough due to the imbalanced data sets. Note that the task of type inference can be seen as a multi-class classification problem, where each column is classified into one of the possible column types. In order to measure the performance separately for each type, we follow a one-vs-rest approach. In such cases a common choice of metric is the Jaccard index J (see e.g., Hand et al. (2001, sec 2.3)) defined as \(\text {TP}/(\text {TP} + \text {FP} + \text {FN})\), where TP, FP and FN respectively denote the number of True Positives, False Positives and False Negatives. (Note that one-vs-rest is an asymmetric labelling, so True Negatives are not meaningful in this case.)

To measure the performance on type/non-type inference, we report Area Under Curve (AUC) of Receiver Operating Characteristic (ROC) curves, as well as the percentages of TPs, FPs and FNs. Note that here we denote non-type and type entries as Positive and Negative respectively.

4.2 Quantitative results

We present quantitative results on two tasks: (i) column type inference and (ii) type/non-type detection. In column type inference, we evaluate the performance of the methods on detecting data types of columns, and investigate their scalability. Then, in type/non-type detection we evaluate their capability of detecting missing data and anomalies.

4.2.1 Column type inference

We present the performance of the methods in Table 2, which indicates that our method performs better than the others for all types, except for the date type where it is slightly worse than Trifacta. In the table ptype denotes the discriminatively trained model, and ptype-hc the version with hand-crafted parameters. These improvements are generally due to the robustness of our method against missing data and anomalies.

Notice that the discriminative training improves the performance, specifically for Boolean and integer types. This result shows that the corresponding confusions can be reduced by finding more optimal parameter values, which can be difficult otherwise. Note that the training has a slightly negative effect on float and string types, but it still performs better than the other methods. These changes can be explained by the fact that the cost function aims at decreasing the overall error over columns rather than considering individual performances.

Table 2 Performance of the methods using the Jaccard index and overall accuracy, for the types Boolean, Date, Float, Integer and String
Fig. 3
figure 3

Normalized confusion matrices for a ptype and b Trifacta plotted as Hinton diagrams, where the area of a square is proportional to the magnitude of the entry

Figure 3 shows normalized confusion matrices for ptype and Trifacta, normalized so that a column sums to 1. This shows that they both tend to infer other column types as string even if they are not. However, ptype has few of such confusions, especially when the true type is Boolean or float.

It is noticeable from Table 2 that dates are difficult to infer accurately. Detailed inspection shows that this is due to non-standard formats used to denote dates. We could improve the PFSM for dates in ptype to include such formats, but we have not done so, to avoid optimizing on test datasets. The performance of Trifacta on dates can be explained by the engineering power behind the tool and indicates its capability to represent non-standard formats using validation functions.

To determine whether the column type predictions of ptype and Trifacta are significantly different, we apply the McNemar’s test (see e.g., Dietterich 1998), which assumes that the two methods should have the same error rate under the null hypothesis. We compute the test statistic \((|n_{01}-n_{10}|-1)^2)/(n_{01}+n_{10})\), where \(n_{01}\) and \(n_{10}\) denote the number of test columns misclassified by only Trifacta, and by only ptype respectively. In our case, \(n_{01}\) and \(n_{10}\) are respectively equal to 19 and 6, which results in a statistic of 5.76. If the null hypothesis is correct, then the probability that this statistic is greater than 3.84 is less than 0.05 (Dietterich 1998). Thus this result provides evidence to reject the null hypothesis, and confirms that the methods are statistically significantly different to each other.

The large performance gap for Booleans suggests that our method handles confusions with integers and strings better. The analysis shows that such confusions occur respectively in the presence of \(\{\texttt {0},\texttt {1}\}\), and \(\{\texttt {Yes}, \texttt {No}\}\).Footnote 5 We further note that F#, messytables and readr perform similarly, especially on floats, integers and strings; which is most likely explained by the fact that they employ similar heuristics.

Since hypoparsr names column types differently, except for the date type, we need to rename the annotations, and re-evaluate the methods in order to compare them with hypoparsr. It refers to Boolean and string respectively as logical and text. Moreover, integer and float are grouped into a single type called numeric. The resulting evaluations, which are reported in Table 6 in “Appendix F”, shows a similar trend to as before in that our method performs better. However, we see some variations, which result from the fact that we now evaluate on a smaller number of data sets since hypoparsr, which is said to be designed for small sized files, was able to parse only 33 out of the 43 test data sets.Footnote 6 This left us 358 columns including 29 date, 21 logical, 159 numeric and 149 text. Lastly, we observe that hypoparsr results in a higher number of confusions by inferring the type as integer, whereas the true type is text. Such cases mostly occur when the data values consist of a combination of numeric and text, e.g., ID and address columns.

Fig. 4
figure 4

The time in seconds taken to infer a column type with ptype, as a function of the number of unique values U in the column. Also shown is the line \(c_0 + c_1 U\), where \(c_0\) is a small constant

4.2.2 Scalability

We describe the complexity of the inference in our model in Sect. 2.3. Here, we demonstrate its scaling by measuring the time it takes to infer the column types for each test dataset.

Recall that the complexity is \(O(U K M^2 L)\), where U is the number of unique data entries, K is the number of data types, M is the maximum number of hidden states in the PFSMs, and L is the maximum length of data values. Notice that the complexity depends on data through U and L, and does not necessarily increase with the number of rows. In fact, it grows linearly with the number of unique values assuming L is constant. As shown in Fig. 4, the runtime for ptype is upper bounded by a line \(c_0 + c_1 U\), where \(c_0\) is a small constant. The runtime thus scales linearly with the number of unique values U, handling around 10K unique values per second. The variations can be explained by changes in L.

We also report the size of the datasets and the times the methods take in “Appendix G”. We have observed similar performance with messytables, whereas readr and TDDA seem much faster even though they do not only predict the data types but also parse a given dataset. On the other hand, hypoparsr takes much longer compared to the others. Lastly, we measure the processing times for Trifacta via command line. We have observed that Trifacta takes less time than hypoparsr, but it often takes longer compared to the other methods.

Fig. 5
figure 5

AUC(ptype)—AUC(Trifacta) plotted for each test dataset

4.2.3 Type/non-type inference

Trifacta labels each entry in a data table either as type or non-type, whereas our model presents a probability distribution over the two labels. One could apply a threshold on these probabilities in order to assign a label to each entry. Here, we demonstrate how the methods behave under different thresholds. We aggregate the entries of each dataset over its columns and compute the ROC curve for each method.

Figure 5 presents the difference AUC(ptype)—AUC(Trifacta) per dataset. Note that we exclude five datasets as the corresponding AUCs are undefined due to the definition of True Positive Rate (\(\frac{TP}{TP+FN}\)) which becomes undefined when both TP and FN are equal to zero, which occurs naturally when a dataset does not contain any missing data and anomalies. The average of AUCs of the remaining datasets are respectively 0.77 and 0.93 for Trifacta and ptype. To compare these two sets of AUCs, we apply a paired t-test, which results in the t-statistic of 4.59 and p-value of 0.00005. These results reject the null hypothesis that the means are equal, and confirm that they are significantly different.

Lastly, we compare Trifacta and ptype in terms of percentages of TPs, FPs and FNs which are presented in Table 3, where the labels for ptype are generated by applying a threshold of 0.5 on the posterior distributions. Note that here we aggregate the predictions over the datasets. As per the table, ptype results in a higher number of FPs than Trifacta, but Trifacta produces a higher number of FNs and a lower number of TPs than ptype. Note that here we denote non-type and type entries as Positive and Negative respectively. This indicates that ptype is more likely to identify non-types than Trifacta, but it can also label type entries as non-types more often. However, the overall performance of ptype is better than Trifacta, as we have also observed in the AUCs.

Table 3 The percentages of FPs, FNs and TPs for Trifacta and ptype on type/non-type detection

We now present two examples to give insight into Table 3. Consider the “Substantial_growth_of_knowledge_skills” column of the mass_6 dataset which consists of floating-point numbers. However, most of the 3148 entries are non-types, i.e., empty entries, -, N/A and NA which occur 780, 470, 1063 and 424 times respectively. Such non-type entries are labelled correctly by ptype, whereas Trifacta can only classify the empty entries correctly as non-type. The remainder of the non-type entries are considered to be valid type entries since they conform with the column type which is inferred as string by Trifacta. Note that this confusion is highly likely due to the low number of floats in the column. Here, Trifacta results in a high percentage of FNs, and a lower percentage of TPs than ptype.

The second example illustrates the extent of cases in which ptype results in FPs. For example, the “CAUSE_NAME” column in the data_gov_323_1 dataset consists of data values such as Cancer, Stroke, and Suicide etc. Here, ptype and Trifacta infer the column type as string, and label such entries correctly as type entries. However, Alzheimer’s disease and Parkinson’s disease are misclassified as non-types by ptype (1,860 FPs out of 13,260 entries) as our string model does not support the apostrophe. To handle this, we could include in the corresponding alphabet, but we also find it helpful to detect “true” non-type entries having that character. We believe that such cases should be left to users with domain knowledge as they can easily extend the alphabet of the string type.

4.3 Qualitative results

We now give some examples of predicting missing and anomalous data.

Missing Data We support an extensive range of values that are used to denote missing data. Note that multiple such encodings can be detected at the same time. Consider a T2Dv2 datasetFootnote 7 where missing entries are denoted by encodings such as NULL and n/a. Our method can successfully annotate such entries as shown in Fig. 6.

Fig. 6
figure 6

Annotating a subset of a T2D data set using ptype as normal, missing, and anomalous entries, denoted by green, yellow, and red, respectively (Color figure online)

Next, we show how our model approaches unseen missing data encodings which are not explicitly considered as missing data in our model, but can be handled with the anomaly type. For example, “Time of Incident” column of Reported Taser 2015 data set is expected to contain date values. However, some entries have the value Unknown. Thanks to the PFSM for anomaly type, our model detects such encodings as anomalies as long as the “correct” column type is not string, resulting in a better performance of type inference.

Another advantage of ptype is that it can detect missing data even if their types fit the column type. Consider an integer-typed column Stake in the Rodents data set where -99 is used to denote missing entries. Our method flags those entries as missing instead of treating them as regular integers since the missing type accepts -99. Similarly, we are able to detect string-valued missing data in string-typed columns. When the column type, however, is not string, our model may result in several false alarms by flagging normal entries as missing data. For example, integer values of -1, -99, etc. can also represent normal data instances. We will investigate this using methods as in Qahtan et al. (2018), and develop missing data models specific to each data type, in order to improve this issue.

Anomalies As mentioned earlier, we can also use ptype to detect anomalies. Our model flags such entries automatically since the anomaly model covers a wide range of values, including those that are not supported by the column type.

Figure 6 shows the capability of detecting anomalies when the column type is string. As we do not have the character & in the alphabet of the PFSM for strings, the anomaly machine allows us to detect the anomalies in the “country” column. Similarly, the characters “refer to euro” are not supported by the PFSM for integers, letting us detect the corresponding entry as anomalous. Moreover, we can separately detect the anomalous and missing data as in the “selling rate” column.

Interestingly, we notice that the question mark character ? is used to express the doubt about the collected data in the HES data set, where a data entry contains the value of 6?. We can also see that missing data encodings not incorporated in our missing data model such as NOT AVAILABLE, ?? (double question marks), and , are detected as anomalies.

We now discuss some other aspects of our model, such as ambiguities that can occur, and how they can be handled; and failure cases which can be challenging to avoid.

4.3.1 The limitations of our work

Ambiguous Cases In certain cases the posterior probability distribution over types is not heavily weighted on a particular type. For example, consider a column of data that contains values of NULL and 1. This column could fit multiple PFSMs as 1 can either be an integer, a float, a string, or a Boolean value. However, we have assumed that 0 and 1 are more likely to be indicators of Booleans and have thus tuned the parameters of the corresponding machine such that the probabilities associated with 0 and 1 are slightly higher than the ones assigned by the other machines. This leads to a posterior probability distribution with values of 0.29, 0.26, and 0.44 respectively for integers, floats, and Booleans. One can also exploit the probabilistic nature of our model to treat such ambiguous cases differently. Instead of directly taking the type with the highest posterior probability as the column type, one can detect such ambiguities and then exploit user feedback to improve the decision.

A uniformly distributed posterior distribution over types is observed when all of the entries of a column are assigned zero probabilities by the PFSMs of the regular data types. This result is not surprising as we only support a limited set of characters in the regular machines, i.e., the widest alphabet among the regular data types, which is of the string type, consists of the letters, digits and a set of punctuations. For example, “per capitagdp (us$)[51]” column of a T2D data setFootnote 8 has values such as $2949.57. Similarly, the 23rd column of the fuel data set contains values such as “( 3.0L) ”. Note that the anomaly type still assigns positive probabilities to these entries, as its alphabet includes all the possible characters. However, when its weight \(\pi ^a_t\) is the same regardless of the type, the corresponding posterior probabilities become equal. One can handle such cases by learning different weight parameters for the types. However, here, to represent such unsupported/unknown types, we have developed X-type, an anomaly-like PFSM which is used as a regular column type. This can be further exploited to create new data types, which can be automated via a probabilistic training procedure.

Failure Cases We now present two cases for which ptype-hc fails to infer the column types correctly. For example, consider a column (“BsmtHalfBath” of Housing Price data set) which denotes the number of half bathrooms in the basement of houses, consisting of the values in \(\{\texttt {0}, \texttt {1}, \texttt {2}\}\). In this case, ptype-hc puts higher posterior probability on the Boolean type, whereas the actual type is integer. This may not be surprising, considering the fact that 0 and 1 have higher probabilities of being a Boolean and 2, occurring only twice out of 1460 entries, is treated as an anomaly. However, ptype can correct this failure thanks to the discriminative training. Note that the competitor methods fail in this case.

After the evaluations, we have discovered a set of cases we have not considered in the beginning. For example, several Boolean columns of the Census Income KDD data set have leading whitespace as in “ No”, “ Yes”. Our model infers the types of these Boolean columns as string since such whitespace is not considered in the Boolean type. In order to avoid optimizing the model on the test sets, we have not addressed such cases. However, they can easily be handled by updating the corresponding PFSMs to include the whitespace. Note that the other methods also detect the column types as string in these cases.

There are cases of non-standard dates we do not currently handle. For example, dates are sometimes divided into multiple columns as day, month and year. Our model detects day and month columns separately as integers. One could develop a model that checks for this pattern, making use of constraints on valid day, month and year values.

5 Summary

We have presented ptype, a probabilistic model for column type inference that can robustly detect the type of each column in a given data table, and label non-type entries in each column. The proposed model is built on PFSMs to represent regular data types (e.g., integers, strings, dates, Booleans, etc.), missing and anomaly types. An advantage of PFSMs over regular expressions is their ability to generate weighted posterior predictions even when a column of data is consistent with more than one type model. We have also presented a discriminative training procedure which helps to improve column type inference. Our experiments have demonstrated that we generally achieve better results than competitor methods on messy data sets.

Future work includes extending the supported data types, such as categorical data, etc.; developing subtypes, e.g., for Booleans expecting either True and False, or yes and no; and improving anomaly detection for string-typed data by addressing semantic and syntactic errors.