Abstract
Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms existing methods.
Similar content being viewed by others
1 Introduction
Data analytics can be defined as the process of generating useful insights from raw data sets. Central to the entire process is the concept of data wrangling, which refers to the task of understanding, interpreting, preparing a raw data set and turning it into a usable format. This task can lead to a frustrating and time-consuming process for large data sets, and even possibly for some small sized ones (Dasu and Johnson 2003). In order to design corresponding transformations easily, one may desire to know characteristics of given data, e.g., data types, missing data, and anomalies. However, raw data often do not contain any well-documented prior information such as meta-data.
A data type such as integer, string, date, and float can be can be ascribed to a data column. Our focus here is on inferring the data type for each column in a table of data. Numerous studies have attempted to tackle type inference, including wrangling tools (Raman and Hellerstein 2001; Kandel et al. 2011; Guo et al. 2011; Trifacta 2018; Fisher and Gruber 2005; Fisher et al. 2008), software packages (Petricek et al. 2016; Lindenberg 2017; Stochastic Solutions 2018; Döhmen et al. 2017a; Wickham et al. 2017) and probabilistic approaches (Valera and Ghahramani 2017; Vergari et al. 2019; Limaye et al. 2010). However, often they do not work very well in the presence of missing and anomalous data, which are commonly found in raw data sets due to the lack of a well-organized data collection procedure.
One can develop a more accurate system by detecting missing and anomalous entries and then performing type inference on the valid data values only. However, distinguishing such entries can be challenging when missing data is not encoded explicitly, but as e.g., -1 or -99 in a column of integer type. Pearson (2006) refers to such problems as the problem of disguised missing data. He further shows how interpreting missing data as valid values can mislead the analysis. Similar problems are also discussed by Qahtan et al. (2018). Existing type inference methods do not consider such problems comprehensively. For example, they may fail when a column of integers contains string-valued missing data encodings such as null and NA. In this work, we incorporate such known missing data encodings into our probabilistic model.
Chandola et al. (2009) define an anomaly as “a pattern that does not conform to expected normal behavior”. In our context, anomalies refer to unexpected or invalid entries for a given column, which might result from the data collection procedure, e.g., error and warning messages generated by servers or the use of open-ended entries while collecting data. The challenge here is mostly due to the difficulty of modeling normal and anomalous data values such that the corresponding entries become distinguishable. Moreover, one may need to consider separate strategies to model anomalies for different data types since their structures may vary with respect to the data types. For example, anomalies in date columns might have a different form than those in integer columns.
Up to now, too little attention has been paid to the aforementioned problems by the data mining community. To this end, we introduce ptype, a probabilistic type inference method that can robustly annotate a column of data. The proposed model is built upon Probabilistic Finite-State Machines (PFSMs) that are used to model known data types, missing and anomalous data. In contrast to the standard use of regular expressions, PFSMs have the advantage of generating weighted posterior predictions even when a column of data is consistent with more than one type model. Our method generally outperforms existing type inference approaches for inferring data types, and also allows us to identify missing and anomalous data entries.
In the rest of this paper, we first describe PFSMs and introduce our model (Sect. 2). We then present the related work (Sect. 3), which is followed by the experiments and the results (Sect. 4). Finally, we summarize our work and discuss the possible future research directions (Sect. 5).
2 Methodology
This section consists of four parts. Section 2.1 gives background information on PFSMs used to model regular data types, missing data and anomalies. Section 2.2 introduces our model that uses a mixture of PFSMs. Lastly, Sects. 2.3 and 2.4 describe respectively inference in and training of this model.
The data type, missing data and anomalies can be defined in broad terms as follows: The data type is the common characteristic that is expected to be shared by entries in a column, such as integers, strings, IP addresses, dates, etc., while missing data denotes an absence of a data value which can be encoded in various ways, and anomalies refer to values whose types differ from the given column type or the missing type.
In order to model the above types, we have developed PFSMs that can generate values from the corresponding domains. This, in turn, allows us to calculate the probability of a given data value being generated by a particular PFSM. We then combine these PFSMs in our model such that a data column can be annotated via probabilistic inference in the proposed model, i.e., given a column of data, we can infer column type and rows with missing and anomalous values.
2.1 Probabilistic finite-state machines
Finite-State Machines (FSMs) are a class of mathematical models used to represent systems consisting of a finite number of internal states. The idea is to model a system by defining its states (including initial and final states), transitions between the states, and external inputs/outputs. FSMs have a long history going back at least to Rabin and Scott (1959) and Gill (1962). A more recent overview of FSMs is given by Hopcroft et al. (2001).
In this study, we are interested in a special type of FSMs called Probabilistic Finite-State Machines (PFSMs) in which transitions between states occur w.r.t. probability distributions (Paz 1971; Rabin 1963). Vidal et al. (2005) discuss various PFSMs in detail. Following a similar notation, we define a PFSM as a tuple \(A=\left( \theta , \varSigma , \delta , I, F, T\right) \), where \(\theta \) is a finite set of states, \(\varSigma \) is a set of observed symbols, \(\delta \subseteq \theta \times \varSigma \times \theta \) is a set of transitions among states w.r.t. to observed symbols, \(I : \theta \rightarrow \mathbb {R}^+\) is the initial-state probabilities, \(F : \theta \rightarrow \mathbb {R}^+\) is the final-state probabilities, and \(T : \delta \rightarrow \mathbb {R}^+\) is the transition probabilities for elements of \(\delta \). During each possible transition between states, a symbol is emitted. We denote such an event by a triple \((q,\alpha ,q')\), which corresponds to a transition from a state \(q \in \theta \) to a state \(q' \in \theta \) emitting a symbol \(\alpha \in \varSigma \). Note that \(\delta \) and T store respectively all the possible triples and their corresponding probabilities.
A PFSM has to satisfy certain conditions. First, the sum of the initial-state probabilities has to be equal to 1. Secondly, at each state q, it can either transition to another state \(q' \in \theta \) and emit a symbol \(\alpha \in \varSigma \), or stop at state q without emitting any symbol. This condition can be expressed mathematically as \(F(q) + \sum _{\alpha \in \varSigma , q' \in \theta } T(q,\alpha ,q')=1\) for each state q, where \(T(q,\alpha ,q')\) represents the probability of a triple \((q,\alpha ,q')\), and F(q) denotes the final-state probability of state q. Based on the definition given above, a PFSM can generate a set of strings, denoted by \(\varSigma ^*\).Footnote 1 For each string \(s \in \varSigma ^*\), we can calculate a probability that represents how likely it is for a given PFSM to generate the corresponding string.
Note that PFSMs resemble Hidden Markov Models (HMMs) except that we now have the final state probabilities. Recall that each state in a HMM has a probability distribution over the possible states that it can transition to. In PFSMs, each state also takes into account the probability of being a final state. Hence, the probability distribution is not only defined over the possible transitions to next states; it also includes the case of the current state being a final state. On the other hand, emissions are carried out similarly in PFSMs and HMMs: the observations are generated conditioned on the hidden states in HMMs; an observation is emitted through a transition in PFSMs since each transition is associated with a symbol. A detailed analysis of the link between PFSMs and HMMs can be found in Dupont et al. (2005).
One can develop PFSMs to represent types described previously and calculate the probabilities for each observed data value. We now explain the corresponding PFSMs in detail.
2.1.1 Representing types with PFSMs
Here, we show how a PFSM can be used to model a data type. We divide types into two groups: (i) primitive types consisting of integers, floats, Booleans and strings and (ii) complex types such as IP addresses, email addresses, phone numbers, dates, postcodes, genders and URLs. The details regarding the implementation of the corresponding PFSMs can be found in “Appendix A”.
Consider integer numbers whose domain is \([-\infty ,\infty ]\). We can represent the corresponding PFSM as in the diagram given in Fig. 1. The machine has two initial states, namely \(q_0\) and \(q_1\), and one final state \(q_2\). Here, \(q_0\) and \(q_1\) respectively allow us to represent integer numbers with a sign (plus or minus), or without any sign.Footnote 2 The machine eventually transitions to the state \(q_2\), which stops with a stopping probability \(F(q_2) = P_{stop}\). Otherwise, it transitions to itself by emitting a digit with an equal probability, \((1-P_{stop})/10\). Similarly, we can develop a PFSM to represent each one of the other column types.
The PFSM for missing values can be developed by using a predefined set of codes such as \(\{\texttt {-1}, \texttt {-9}, \texttt {-99}, \texttt {NA}, \texttt {NULL}, \texttt {N/A}, \texttt {-}, \dots \}\). It assigns non-zero probabilities to each element in this set. Note that the corresponding PFSM already supports a wide range of codes, but it can be easily extended by the user to incorporate other known missing data encodings, leading to semi-automated operation.
In order to model anomalies, we adapt the idea of an X-factor proposed by Quinn et al. (2009). We define a machine with the widest domain among all the PFSMs that supports all of the possible characters. This choice of PFSM lets the probabilistic model become more robust against anomalies since it assigns probabilities to data values that do not belong to any of the PFMSs representing the data types. Note that as this PFSM covers a wider domain, it will assign lower probabilities to known data types than the specific models.
Constructing PFSMs for complex types might require more human engineering than the other types. We reduce this need by building such PFSMs automatically from corresponding regular expressions. We first convert regular expressions to FSMs by using greenery (2018) (see the function named to_fsm()). We then build their probabilistic variants, where the parameters are assigned equal probabilities. Note that these parameters can be updated with the training.
2.2 The proposed model
We propose a new probabilistic mixture model with a noisy observation model, allowing us to detect missing and anomalous data entries. Our model first generates a column type from a set of possible regular data types. This is followed by a “deficiency” process that can potentially change the data type of each row. Consequently, each row might have a different type rather than the generated column type. The observation model then generates a data value for each entry according to its type. We now introduce our notation to represent this process.
We assume that a column of data \(\mathbf{x} =\{x_i\}_{i=1}^N\) has been read in, where each \(x_i\) denotes the characters in the ith row. We propose a generative model with a set of latent variables \(t \in \{1,2,\ldots ,K\}\) and \(\mathbf{z} =\{z_i\}_{i=1}^N\), where t and \(z_i\) respectively denote the data type of a column and its ith row. Here, N is the number of rows in a data column, and K is the number of possible data types for a column. We also use the additional missing and anomaly types, denoted by m and a respectively, and described above. Note that \(z_i\) can be of type m or a alongside a regular data type, i.e., \(z_i \in \{1,2,\ldots ,K,m,a\}\). This noisy observation model allows a type inference procedure robustified for missing and anomalous data values.
Hence the model has the following generative process:
where \(\varvec{\varPi }\) and \(p(x_i|z_i)\) are respectively the model parameter and the observation model. \(\mathcal {U}\) denotes a discrete Uniform distribution. Here \(\pi _t^t + \pi _t^m + \pi _t^a = 1\) for each column type t. Since entries are often expected to be of a regular data type rather than missing or anomaly types, we favour regular types during inference by using lower coefficients for missing and anomaly types, i.e., \(\pi _t^m < \pi _t^t\) and \(\pi _t^a < \pi _t^t\). These weight parameters \(\varvec{\varPi }\) are assumed to be fixed and known. Even though one can also learn the weights, this may not be vital as long as the coefficients of the regular types are larger than the others.
We use a uniform distribution for column types as in most cases we do not have any prior information regarding the type of a column that would allow us to favour a particular one. To represent the conditional distribution \(p(x_i | z_i)\), we have developed PFSMs as described above.
2.3 Inference
Given a data column \(\mathbf{x} \), our initial goal is to infer the column type t, which is cast to the problem of calculating the posterior distribution of t given \(\mathbf{x} \), denoted by \(p(t|\mathbf{x} )\). Then, we assume that each row can be of three types: (i) the same as the column type, (ii) the missing type, and (iii) the anomaly type. In order to identify missing or anomalous data entries, we calculate the posterior probabilities of each row type, namely \(p(z_i|t, \mathbf{x} )\). In this section, we now briefly discuss the corresponding calculations; detailed derivations are presented in “Appendix D”.
Assuming that entries of a data column are conditionally independent given t, the posterior distribution of column type t is given as follows:
which can be used to estimate the column type t, since the one with maximum posterior probability is the most likely data type corresponding to the column \(\mathbf{x} \).
As per Eq. 1, the model estimates the column type by considering all the data rows, i.e., having missing data or anomalies does not confuse the type inference. Note that such entries would have similar likelihoods for each column type, which allows the model to choose the dominant data type for regular entries.
Following the inference of column type, we can also identify entries of \(\mathbf{x} \) which are more likely to be missing or anomalies rather than the inferred type. For this, we compare the posterior probabilities of each row type \(z_i\) given \(t=k\) and \(x_i\), namely \(p(z_i=j | t=k, x_i)\), which can be written as:
2.3.1 Complexity analysis
The computational bottleneck in the inference is the calculation of \(p(\mathbf{x} | t = k)\) for each type k, which is the calculation of the probability assigned for a data column \(\mathbf{x} \) by the kth PFSM. Note that this calculation can be carried out by taking into account the counts of the unique data entries, for efficiency. Denoting the uth unique data value by \(x_u\), we need to consider the complexity of calculating \(p(x_u | t = k)\) which can be done via the PFSM Forward algorithm. Each iteration of the algorithm has the complexity of \(O(M_k^2)\) where \(M_k\) is the number of hidden states in the kth PFSM. As the number of iterations equals to the length of \(x_u\) denoted by L, the overall complexity of the inference becomes \(O(U K M^2 L)\), where U is the number of unique data entries, K is the number of types, and M is the maximum number of hidden states in the PFSMs.
2.4 Training of the model
The aim of the training is to tune the probabilities assigned by the PFSMs so that column types are inferred accurately. Given a set of columns and their annotated column types, the task is to find the parameters of the PFSMs (i.e., the initial-state, the transition and the final-state probabilities) that allow the “correct” machine to give higher probabilities to the observed entries. This is crucial as multiple PFSMs can assign non-zero probabilities for certain strings, e.g., \(\texttt {1}\), \(\texttt {True}\), etc.
We employ a discriminative training procedure on our generative model, as done in discriminative training of HMMs (Bahl et al. 1986; Jiang 2010; Brown 1987; Nádas et al. 1988; Williams and Hinton 1991). This is shown to be generally superior to maximum likelihood estimations (Jiang 2010), since a discriminative criterion is more consistent with the task being optimized. Moreover, it allows us to update not only the parameters of the “correct” PFSM but also the parameters of the other PFSMs given a column of data and its type, which in turn helps the correct one to generate the highest probability.
We choose \(\sum _{j=1}^M \log p(t^j|\mathbf {x}^j)\) as the objective function to maximize, where \(\mathbf {x}^j\) and \(t^j\) respectively denote the jth column of a given data matrix X and its type, and M is the number of columns. We then apply the Conjugate Gradient algorithm to find the parameters that maximize this objective function (please see “Appendix C” for detailed derivations of the gradients).
We study different parameter settings for our model. We first explore tuning the parameters by hand to incorporate certain preferences over the types, e.g., Boolean over Integer for \(\texttt {1}\). Then, we learn the parameters via the discriminative training described above where the parameters are initialized at the hand-crafted values. Note that due to the absence of explicit labels for the missing and anomaly types, these are not updated from the hand-crafted parameters. We have also employed the training by initializing the parameters uniformly. However, we do not report these results as they are not competitive with the others.
As the PFSMs are generative models it would be possible to train them unsupervised, to maximize \(\sum _{i, j} \log p(x^j_i)\), where \(p(x^i_j)\) is defined as a mixture model over all types (including missing and anomaly) for the ith row and jth column. The component PFSMs could then be updated using the Expectation-Maximization (EM) algorithm. However, such training would be unlikely to give as good classification performance as supervised training.
3 Related work
Numerous studies have attempted to tackle type inference, including wrangling tools and software libraries. Such approaches have made limited use of probabilistic methods, while the existing probabilistic approaches do not address the same type inference problem that we do. Below we discuss the capabilities of these existing methods in the presence of missing and anomalous data.
Type inference is commonly carried out by validation functions and regular expressions. For example, several wrangling tools, including Trifacta (2018) and its preceding versions (Raman and Hellerstein 2001; Kandel et al. 2011; Guo et al. 2011) apply validation functions to a sample of data to infer types, e.g., assign the one validated for more than half of the non-missing entries as the column type (Kandel et al. 2011). When multiple types satisfy this criterion, the more specific one is chosen as the column type, e.g., an integer is assumed to be more specific than a float. Trifacta supports a comprehensive set of data types, and provides an automatic discrepancy detector to detect errors in data (Raman and Hellerstein 2001). However, in our experience, its performance on type inference can be limited on messy data sets.
Fisher and Gruber (2005) and Fisher et al. (2008) have developed data description languages for processing ad hoc data (PADS) which enables generating a human-readable description of a dataset, based on data types inferred using regular expressions. However, their focus is on learning regular expressions to describe a dataset, rather than classifying the data columns into known types. Unlike the PADS library which parses a data file itself, Test-Driven Data Analytics (TDDA, Stochastic Solutions 2018) uses the Pandas CSV reader to read the data into a data frame. It then uses the Pandas dtypes attributes,Footnote 3 to determine the data type of the columns. However, this leads to a poor type detection performance since the Pandas reader is not robust against missing data and anomalies, where only empty string, NaN and NULL are treated as missing data.
Petricek et al. (2016) propose another use of regular expressions with F#, where types, referred as shapes, are inferred w.r.t. a set of preferred shape relations. Such relations are used to resolve ambiguous cases in which a data value fits multiple types. This procedure allows integrating inferred types into the process of coding, which can be useful to interact with data. However, it does not address the problems of missing and anomalous data comprehensively, where only three encodings of missing data, NA, #NA and :, are supported. This may lead to poor performance on type inference when missing data is encoded differently, or there are anomalies such as error messages.
A set of software packages in R and Python can also infer data types. messytables (Lindenberg 2017) determines the most probable type for a column by weighting the number of successful conversions of its elements to each type. This strategy can potentially help to cope with certain data errors; however, it might be difficult to find an effective configuration of the weights for a good performance. Moreover, it cannot handle the disguised missing data values, e.g., -1 in a column of type integer, which can be misleading for the analysis. Döhmen et al. (2017b) propose a CSV parser named hypoparsr that treats type inference as a parsing step. It takes into account a wide range of missing data encodings; however, it does not address anomalies, leading to poor performance on type inference. readr (Wickham et al. 2017) is an R package to read tabular data, such as CSV and TSV files. However, in contrast to hypoparsr, a limited set of values are considered as missing data, unless the user specifies otherwise. Furthermore, it employs a heuristic search procedure using a set of matching criteria for type inference. The details regarding the criteria are given in Wickham and Grolemund (2016). The search continues until one criterion is satisfied, which can be applied successfully in certain scenarios. Whenever such conditions do not hold, the column type is assigned to string.
Valera and Ghahramani (2017) propose a model for discovery of statistical types, but this model is tackling a very different problem than the one that we address. Their method assumes that all of the entries in a column contain numerical data values, i.e., it cannot handle data types such as string, date, etc. Moreover, it can only handle clean integer and float columns that do not have any missing data and anomalies. Given this, they address the problem of making fine-grained distinctions between different types of continuous variables (real-valued data, positive real-valued data and interval data) and discrete variables (categorical data, ordinal data and count data). It would be possible to combine their method with ours to provide more fine-grained distinctions of real-valued and discrete-valued data once the issues of detecting missing and anomalous data have been resolved.
Valera and Ghahramani (2017) is extended in Vergari et al. (2019) where the proposed method can be used to impute missing data and detect numerical anomalies. However, they also assume that each entry of a data column contains a numerical value or an explicit missing data indicator. Their method cannot be used when a data column contains non-numerical missing data encodings and anomalies. Moreover, the authors address a completely different missing data problem, i.e., they tackle missing data imputation rather than missing data detection. This means that the proposed method cannot detect whether an entry is missing or not, but it can impute missing data once the location of missing data is known. Lastly, their focus on anomaly detection is also different than ours in the sense that they attempt to detect numerical outliers, but do not detect string anomalies in a column.
Limaye et al. (2010) propose a log-linear model based method to annotate a table in terms of the semantic column type, cell entity and column relations, given an ontology of relevant information. For example, given a table that contains information about actors such as names and ages, the task is to find semantic column types such as actor and age, the entities each row refers to, and the relationship of actor-age between two columns. Even though the scope of these annotations are wider than ours, missing data and anomalies are not considered explicitly, and an appropriate ontology is required.
4 Experiments
In this section, we first in Sect. 4.1 describe the datasets and evaluation metrics used, and then in Sect. 4.2 compare ptype with competitor methods on two tasks: (i) column type inference and (ii) type/non-type inference. Lastly, we present a qualitative evaluation of our method on challenging cases in Sect. 4.3. The goal of our experiments is to evaluate (i) the robustness of the methods against missing data and anomalies for column type inference and (ii) the effectiveness of type/non-type inference. These are evaluated both quantitatively (Sect. 4.2) and quantitatively (Sect. 4.3). We release our implementation of the proposed method at https://github.com/tahaceritli/ptype-dmkd.
4.1 Experimental setup
We have trained ptype on 25 and tested on 43 data sets obtained from various sources including UCI ML,Footnote 4https://data.gov.uk, https://ukdataservice.ac.uk and https://data.gov. The data types were annotated by hand for these sets. We also annotated each dataset in terms of missing data and anomalies by using the available meta-data and checking the unique values.
On column type inference, we compare our method with F# (Petricek et al. 2016), hypoparsr (Döhmen et al. 2017a), messytables (Lindenberg 2017), readr (Wickham et al. 2017), TDDA (Stochastic Solutions 2018) and Trifacta (2018). Note that some of the related works are not directly applicable to this task, and these are not included in these experiments. For example, we are not able to use Raman and Hellerstein (2001), Kandel et al. (2011) and Guo et al. (2011) as they are no longer available. However, we use their latest version Trifacta in our experiments. We also exclude the PADS library (Fisher and Gruber 2005; Fisher et al. 2008), since it does not necessarily produce columns and their types. The methods proposed by Valera and Ghahramani (2017) and Vergari et al. (2019) are also not applicable to this task. First, they do not consider data types of Boolean, string, date. Secondly, they only address integer and float columns that do not contain any non-numerical missing data or anomalies, which are commonly found in real-world datasets. Note that Vergari et al. (2019) do not address missing data detection but missing data imputation, and can only handle numerical outliers but not non-numerical outliers, whereas Valera and Ghahramani (2017) do not address these questions. Lastly, we exclude the method presented by Limaye et al. (2010) as their goal is to infer semantic entity types rather than syntactic data types.
On type/non-type inference, we evaluate the performance of our method in detecting missing data and anomalies. We label such entries as non-type and classify each entry as either type or non-type. On this task, we compare our method with Trifacta only, as it is the leading competitor method on column type inference, and the others do not address this task comprehensively.
4.1.1 Data sets
We have conducted experiments on the data sets chosen according to two criteria: (i) coverage of the data types and (ii) data quality. In our experiments, we consider five common column types, including Boolean, date, float, integer and string. Table 1 presents the distribution of the column types found in our data sets. Any other columns not conforming to the supported data types are omitted from the evaluations. Secondly, we have selected messy data sets in order to evaluate the robustness against missing data and anomalies. The fraction of the non-type entries in the test datasets can be as large as 0.56, while the average fraction is 0.11 (see Fig. 2 for the details). Note that available data sets, their descriptions and the corresponding annotations can be accessed via https://goo.gl/v298ER.
4.1.2 Evaluation metrics
For column type inference, we first evaluate the overall accuracy of the methods on type inference by using the accuracy. However, this evaluation may not be informative enough due to the imbalanced data sets. Note that the task of type inference can be seen as a multi-class classification problem, where each column is classified into one of the possible column types. In order to measure the performance separately for each type, we follow a one-vs-rest approach. In such cases a common choice of metric is the Jaccard index J (see e.g., Hand et al. (2001, sec 2.3)) defined as \(\text {TP}/(\text {TP} + \text {FP} + \text {FN})\), where TP, FP and FN respectively denote the number of True Positives, False Positives and False Negatives. (Note that one-vs-rest is an asymmetric labelling, so True Negatives are not meaningful in this case.)
To measure the performance on type/non-type inference, we report Area Under Curve (AUC) of Receiver Operating Characteristic (ROC) curves, as well as the percentages of TPs, FPs and FNs. Note that here we denote non-type and type entries as Positive and Negative respectively.
4.2 Quantitative results
We present quantitative results on two tasks: (i) column type inference and (ii) type/non-type detection. In column type inference, we evaluate the performance of the methods on detecting data types of columns, and investigate their scalability. Then, in type/non-type detection we evaluate their capability of detecting missing data and anomalies.
4.2.1 Column type inference
We present the performance of the methods in Table 2, which indicates that our method performs better than the others for all types, except for the date type where it is slightly worse than Trifacta. In the table ptype denotes the discriminatively trained model, and ptype-hc the version with hand-crafted parameters. These improvements are generally due to the robustness of our method against missing data and anomalies.
Notice that the discriminative training improves the performance, specifically for Boolean and integer types. This result shows that the corresponding confusions can be reduced by finding more optimal parameter values, which can be difficult otherwise. Note that the training has a slightly negative effect on float and string types, but it still performs better than the other methods. These changes can be explained by the fact that the cost function aims at decreasing the overall error over columns rather than considering individual performances.
Figure 3 shows normalized confusion matrices for ptype and Trifacta, normalized so that a column sums to 1. This shows that they both tend to infer other column types as string even if they are not. However, ptype has few of such confusions, especially when the true type is Boolean or float.
It is noticeable from Table 2 that dates are difficult to infer accurately. Detailed inspection shows that this is due to non-standard formats used to denote dates. We could improve the PFSM for dates in ptype to include such formats, but we have not done so, to avoid optimizing on test datasets. The performance of Trifacta on dates can be explained by the engineering power behind the tool and indicates its capability to represent non-standard formats using validation functions.
To determine whether the column type predictions of ptype and Trifacta are significantly different, we apply the McNemar’s test (see e.g., Dietterich 1998), which assumes that the two methods should have the same error rate under the null hypothesis. We compute the test statistic \((|n_{01}-n_{10}|-1)^2)/(n_{01}+n_{10})\), where \(n_{01}\) and \(n_{10}\) denote the number of test columns misclassified by only Trifacta, and by only ptype respectively. In our case, \(n_{01}\) and \(n_{10}\) are respectively equal to 19 and 6, which results in a statistic of 5.76. If the null hypothesis is correct, then the probability that this statistic is greater than 3.84 is less than 0.05 (Dietterich 1998). Thus this result provides evidence to reject the null hypothesis, and confirms that the methods are statistically significantly different to each other.
The large performance gap for Booleans suggests that our method handles confusions with integers and strings better. The analysis shows that such confusions occur respectively in the presence of \(\{\texttt {0},\texttt {1}\}\), and \(\{\texttt {Yes}, \texttt {No}\}\).Footnote 5 We further note that F#, messytables and readr perform similarly, especially on floats, integers and strings; which is most likely explained by the fact that they employ similar heuristics.
Since hypoparsr names column types differently, except for the date type, we need to rename the annotations, and re-evaluate the methods in order to compare them with hypoparsr. It refers to Boolean and string respectively as logical and text. Moreover, integer and float are grouped into a single type called numeric. The resulting evaluations, which are reported in Table 6 in “Appendix F”, shows a similar trend to as before in that our method performs better. However, we see some variations, which result from the fact that we now evaluate on a smaller number of data sets since hypoparsr, which is said to be designed for small sized files, was able to parse only 33 out of the 43 test data sets.Footnote 6 This left us 358 columns including 29 date, 21 logical, 159 numeric and 149 text. Lastly, we observe that hypoparsr results in a higher number of confusions by inferring the type as integer, whereas the true type is text. Such cases mostly occur when the data values consist of a combination of numeric and text, e.g., ID and address columns.
4.2.2 Scalability
We describe the complexity of the inference in our model in Sect. 2.3. Here, we demonstrate its scaling by measuring the time it takes to infer the column types for each test dataset.
Recall that the complexity is \(O(U K M^2 L)\), where U is the number of unique data entries, K is the number of data types, M is the maximum number of hidden states in the PFSMs, and L is the maximum length of data values. Notice that the complexity depends on data through U and L, and does not necessarily increase with the number of rows. In fact, it grows linearly with the number of unique values assuming L is constant. As shown in Fig. 4, the runtime for ptype is upper bounded by a line \(c_0 + c_1 U\), where \(c_0\) is a small constant. The runtime thus scales linearly with the number of unique values U, handling around 10K unique values per second. The variations can be explained by changes in L.
We also report the size of the datasets and the times the methods take in “Appendix G”. We have observed similar performance with messytables, whereas readr and TDDA seem much faster even though they do not only predict the data types but also parse a given dataset. On the other hand, hypoparsr takes much longer compared to the others. Lastly, we measure the processing times for Trifacta via command line. We have observed that Trifacta takes less time than hypoparsr, but it often takes longer compared to the other methods.
4.2.3 Type/non-type inference
Trifacta labels each entry in a data table either as type or non-type, whereas our model presents a probability distribution over the two labels. One could apply a threshold on these probabilities in order to assign a label to each entry. Here, we demonstrate how the methods behave under different thresholds. We aggregate the entries of each dataset over its columns and compute the ROC curve for each method.
Figure 5 presents the difference AUC(ptype)—AUC(Trifacta) per dataset. Note that we exclude five datasets as the corresponding AUCs are undefined due to the definition of True Positive Rate (\(\frac{TP}{TP+FN}\)) which becomes undefined when both TP and FN are equal to zero, which occurs naturally when a dataset does not contain any missing data and anomalies. The average of AUCs of the remaining datasets are respectively 0.77 and 0.93 for Trifacta and ptype. To compare these two sets of AUCs, we apply a paired t-test, which results in the t-statistic of 4.59 and p-value of 0.00005. These results reject the null hypothesis that the means are equal, and confirm that they are significantly different.
Lastly, we compare Trifacta and ptype in terms of percentages of TPs, FPs and FNs which are presented in Table 3, where the labels for ptype are generated by applying a threshold of 0.5 on the posterior distributions. Note that here we aggregate the predictions over the datasets. As per the table, ptype results in a higher number of FPs than Trifacta, but Trifacta produces a higher number of FNs and a lower number of TPs than ptype. Note that here we denote non-type and type entries as Positive and Negative respectively. This indicates that ptype is more likely to identify non-types than Trifacta, but it can also label type entries as non-types more often. However, the overall performance of ptype is better than Trifacta, as we have also observed in the AUCs.
We now present two examples to give insight into Table 3. Consider the “Substantial_growth_of_knowledge_skills” column of the mass_6 dataset which consists of floating-point numbers. However, most of the 3148 entries are non-types, i.e., empty entries, -, N/A and NA which occur 780, 470, 1063 and 424 times respectively. Such non-type entries are labelled correctly by ptype, whereas Trifacta can only classify the empty entries correctly as non-type. The remainder of the non-type entries are considered to be valid type entries since they conform with the column type which is inferred as string by Trifacta. Note that this confusion is highly likely due to the low number of floats in the column. Here, Trifacta results in a high percentage of FNs, and a lower percentage of TPs than ptype.
The second example illustrates the extent of cases in which ptype results in FPs. For example, the “CAUSE_NAME” column in the data_gov_323_1 dataset consists of data values such as Cancer, Stroke, and Suicide etc. Here, ptype and Trifacta infer the column type as string, and label such entries correctly as type entries. However, Alzheimer’s disease and Parkinson’s disease are misclassified as non-types by ptype (1,860 FPs out of 13,260 entries) as our string model does not support the apostrophe. To handle this, we could include ’ in the corresponding alphabet, but we also find it helpful to detect “true” non-type entries having that character. We believe that such cases should be left to users with domain knowledge as they can easily extend the alphabet of the string type.
4.3 Qualitative results
We now give some examples of predicting missing and anomalous data.
Missing Data We support an extensive range of values that are used to denote missing data. Note that multiple such encodings can be detected at the same time. Consider a T2Dv2 datasetFootnote 7 where missing entries are denoted by encodings such as NULL and n/a. Our method can successfully annotate such entries as shown in Fig. 6.
Next, we show how our model approaches unseen missing data encodings which are not explicitly considered as missing data in our model, but can be handled with the anomaly type. For example, “Time of Incident” column of Reported Taser 2015 data set is expected to contain date values. However, some entries have the value Unknown. Thanks to the PFSM for anomaly type, our model detects such encodings as anomalies as long as the “correct” column type is not string, resulting in a better performance of type inference.
Another advantage of ptype is that it can detect missing data even if their types fit the column type. Consider an integer-typed column Stake in the Rodents data set where -99 is used to denote missing entries. Our method flags those entries as missing instead of treating them as regular integers since the missing type accepts -99. Similarly, we are able to detect string-valued missing data in string-typed columns. When the column type, however, is not string, our model may result in several false alarms by flagging normal entries as missing data. For example, integer values of -1, -99, etc. can also represent normal data instances. We will investigate this using methods as in Qahtan et al. (2018), and develop missing data models specific to each data type, in order to improve this issue.
Anomalies As mentioned earlier, we can also use ptype to detect anomalies. Our model flags such entries automatically since the anomaly model covers a wide range of values, including those that are not supported by the column type.
Figure 6 shows the capability of detecting anomalies when the column type is string. As we do not have the character & in the alphabet of the PFSM for strings, the anomaly machine allows us to detect the anomalies in the “country” column. Similarly, the characters “refer to euro” are not supported by the PFSM for integers, letting us detect the corresponding entry as anomalous. Moreover, we can separately detect the anomalous and missing data as in the “selling rate” column.
Interestingly, we notice that the question mark character ? is used to express the doubt about the collected data in the HES data set, where a data entry contains the value of 6?. We can also see that missing data encodings not incorporated in our missing data model such as NOT AVAILABLE, ?? (double question marks), and –, are detected as anomalies.
We now discuss some other aspects of our model, such as ambiguities that can occur, and how they can be handled; and failure cases which can be challenging to avoid.
4.3.1 The limitations of our work
Ambiguous Cases In certain cases the posterior probability distribution over types is not heavily weighted on a particular type. For example, consider a column of data that contains values of NULL and 1. This column could fit multiple PFSMs as 1 can either be an integer, a float, a string, or a Boolean value. However, we have assumed that 0 and 1 are more likely to be indicators of Booleans and have thus tuned the parameters of the corresponding machine such that the probabilities associated with 0 and 1 are slightly higher than the ones assigned by the other machines. This leads to a posterior probability distribution with values of 0.29, 0.26, and 0.44 respectively for integers, floats, and Booleans. One can also exploit the probabilistic nature of our model to treat such ambiguous cases differently. Instead of directly taking the type with the highest posterior probability as the column type, one can detect such ambiguities and then exploit user feedback to improve the decision.
A uniformly distributed posterior distribution over types is observed when all of the entries of a column are assigned zero probabilities by the PFSMs of the regular data types. This result is not surprising as we only support a limited set of characters in the regular machines, i.e., the widest alphabet among the regular data types, which is of the string type, consists of the letters, digits and a set of punctuations. For example, “per capitagdp (us$)[51]” column of a T2D data setFootnote 8 has values such as $2949.57. Similarly, the 23rd column of the fuel data set contains values such as “( 3.0L) ”. Note that the anomaly type still assigns positive probabilities to these entries, as its alphabet includes all the possible characters. However, when its weight \(\pi ^a_t\) is the same regardless of the type, the corresponding posterior probabilities become equal. One can handle such cases by learning different weight parameters for the types. However, here, to represent such unsupported/unknown types, we have developed X-type, an anomaly-like PFSM which is used as a regular column type. This can be further exploited to create new data types, which can be automated via a probabilistic training procedure.
Failure Cases We now present two cases for which ptype-hc fails to infer the column types correctly. For example, consider a column (“BsmtHalfBath” of Housing Price data set) which denotes the number of half bathrooms in the basement of houses, consisting of the values in \(\{\texttt {0}, \texttt {1}, \texttt {2}\}\). In this case, ptype-hc puts higher posterior probability on the Boolean type, whereas the actual type is integer. This may not be surprising, considering the fact that 0 and 1 have higher probabilities of being a Boolean and 2, occurring only twice out of 1460 entries, is treated as an anomaly. However, ptype can correct this failure thanks to the discriminative training. Note that the competitor methods fail in this case.
After the evaluations, we have discovered a set of cases we have not considered in the beginning. For example, several Boolean columns of the Census Income KDD data set have leading whitespace as in “ No”, “ Yes”. Our model infers the types of these Boolean columns as string since such whitespace is not considered in the Boolean type. In order to avoid optimizing the model on the test sets, we have not addressed such cases. However, they can easily be handled by updating the corresponding PFSMs to include the whitespace. Note that the other methods also detect the column types as string in these cases.
There are cases of non-standard dates we do not currently handle. For example, dates are sometimes divided into multiple columns as day, month and year. Our model detects day and month columns separately as integers. One could develop a model that checks for this pattern, making use of constraints on valid day, month and year values.
5 Summary
We have presented ptype, a probabilistic model for column type inference that can robustly detect the type of each column in a given data table, and label non-type entries in each column. The proposed model is built on PFSMs to represent regular data types (e.g., integers, strings, dates, Booleans, etc.), missing and anomaly types. An advantage of PFSMs over regular expressions is their ability to generate weighted posterior predictions even when a column of data is consistent with more than one type model. We have also presented a discriminative training procedure which helps to improve column type inference. Our experiments have demonstrated that we generally achieve better results than competitor methods on messy data sets.
Future work includes extending the supported data types, such as categorical data, etc.; developing subtypes, e.g., for Booleans expecting either True and False, or yes and no; and improving anomaly detection for string-typed data by addressing semantic and syntactic errors.
Notes
\(\varSigma ^*\) denotes the set of strings a PFSM can generate by emitting multiple symbols.
The transitions from state \(q_1\) allow the emission of a zero, which means that numbers like \(-007\) can be emitted. If this is not desired, then one can adjust the PFSM in Fig. 1 to not emit a leading 0.
obtained with the function pandas_tdda_type().
We assume that a data column where the entries are valued as Yes or No is more likely to be a Boolean column than a string. We have also confirmed these cases with the corresponding metadata whenever available, and have carefully annotated our datasets in terms of data types.
This was using the default settings, but Till Döhmen (pers. comm.) advised against changing the limit.
34899692_0_6530393048033763438.csv.
24036779_0_5608105867560183058.csv.
a subset is available at https://github.com/ropensci/cleanEHR/blob/master/data/sample_ccd.RData.
References
Bahl L, Brown P, De Souza P, Mercer R (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: IEEE ICASSP’86, IEEE, vol 11, pp 49–52
Brown PF (1987) The acoustic-modeling problem in automatic speech recognition. Ph.D. Dissertation, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv (CSUR) 41(3):15
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning: an overview. In: Exploratory data mining and data cleaning, vol 479, 1st edn, John Wiley & Sons, Inc., New York, NY, USA, chap 1, pp 1–16
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923
Döhmen T, Mühleisen H, Boncz P (2017a) hypoparsr. https://github.com/tdoehmen/hypoparsr [Accessed on 29/06/2018]
Döhmen T, Mühleisen H, Boncz P (2017b) Multi-hypothesis CSV parsing. In: Proceedings of the 29th SSDBM
Dupont P, Denis F, Esposito Y (2005) Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms. Pattern Recognit 38(9):1349–1371
Fisher K, Gruber R (2005) PADS: a domain-specific language for processing ad hoc data. In: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation (PLDI’05), ACM, vol 40(6), pp 295–304
Fisher K, Walker D, Zhu KQ, White P (2008) From dirt to shovels: fully automatic tool generation from ad hoc data. In: POPL ’08, ACM, vol 43(1), pp 421–434
Gill A (1962) The basic model. In: Introduction to the theory of finite-state machines, McGraw-Hill Book Company, pp 1–15, 10.2307/2003459
Greenery (2018) greenery. https://github.com/qntm/greenery/ Accessed 31 May 2019
Guo PJ, Kandel S, Hellerstein JM, Heer J (2011) Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In: UIST ’11, ACM, pp 65–74
Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, Cambridge
Hopcroft JE, Motwani R, Ullman JD (2001) Introduction to automata theory, languages, and computation. ACM SIGACT News 32(1):60–65
Jiang H (2010) Discriminative training of HMMs for automatic speech recognition: a survey. Comput Speech Lang 24(4):589–608
Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: CHI ’11, ACM, pp 3363–3372
Limaye G, Sarawagi S, Chakrabarti S (2010) Annotating and searching web tables using entities, types and relationships. VLDB ’10 3(1–2):1338–1347
Lindenberg F (2017) messytables Documentation Release 0.3. https://media.readthedocs.org/pdf/messytables/latest/messytables.pdf Accessed 29 June 2018
Nádas A, Nahamoo D, Picheny MA (1988) On a model-robust training method for speech recognition. IEEE Trans Acoust Speech Signal Process 36(9):1432–1436
Paz A (1971) Introduction to probabilistic automata, vol 78. Academic Press Inc, New York
Pearson RK (2006) The problem of disguised missing data. ACM SIGKDD Explor 8(1):83–92
Petricek T, Guerra G, Syme D (2016) Types from data: making structured data first-class citizens in F#. In: PLDI 2016
Qahtan AA, Elmagarmid A, Castro Fernandez R, Ouzzani M, Tang N (2018) FAHES: a robust disguised missing values detector. In: Proceedings of the 24th ACM SIGKDD, ACM, pp 2100–2109
Quinn JA, Williams CKI, McIntosh N (2009) Factorial switching linear dynamical systems applied to physiological condition monitoring. IEEE Trans Pattern Anal Mach Intell 31(9):1537–1551
Rabin MO (1963) Probabilistic automata. Inf Control 6(3):230–245
Rabin MO, Scott D (1959) Finite automata and their decision problems. IBM J Res Dev 3(2):114–125. https://doi.org/10.1147/rd.32.0114
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: VLDB ’01, Morgan Kaufmann Publishers Inc., pp 381–390
Solutions Stochastic (2018) Test-driven data analysis. https://tdda.readthedocs.io/en/tdda-1.0.23/constraints.html Accessed 8 April 2019
Trifacta (2018) Trifacta Wrangler. https://www.trifacta.com/ Accessed 27 June 2018
Valera I, Ghahramani Z (2017) Automatic discovery of the statistical types of variables in a dataset. In: Proceedings of the 34th ICML, PMLR, vol 70, pp 3521–3529
Vergari A, Molina A, Peharz R, Ghahramani Z, Kersting K, Velera I (2019) Automatic Bayesian density analysis. In: Proceedings of the 33rd AAAI
Vidal E, Thollard F, de la Higuera C, Casacuberta F, Carrasco RC (2005) Probabilistic finite-state machines—Part I. IEEE Trans Pattern Anal Mach Intell 27(7):1013–1025
Wickham H, Grolemund G (2016) R for data science: import, tidy, transform, visualize, and model data, 1st edn, O’Reilly Media, Inc., chap 8, pp 137–138. http://r4ds.had.co.nz/data-import.html Accessed 24 July 2018
Wickham H, Hester J, Francois R, Jylänki J, Jørgensen M (2017) readr 1.1.1. https://cran.r-project.org/web/packages/readr/readr.pdf Accessed 29 June 2018
Williams CKI, Hinton GE (1991) Mean field networks that learn to discriminate temporally distorted strings. In: Proceedings of the 1990 Connectionist Models Summer School, Morgan Kaufmann Publishers, Inc., pp 18–22
Acknowledgements
TC is supported by a PhD studentship from the Alan Turing Institute, under the EPSRC Grant TU/C/000018. CW and JG would like to acknowledge the funding provided by the UK Government’s Defence & Security Programme in support of the Alan Turing Institute. The work of CW is supported in part by EPSRC Grant EP/N510129/1 to the Alan Turing Institute. We thank the anonymous reviewers for their comments that have helped improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
In this Appendices, we discuss the implementation of the Probabilistic Finite State Machines (PFSMs) (“Appendix A”), describe the data sets used (“Appendix B”) and present the derivations for training and inference in our model (“Appendix C and D”, respectively). Moreover, we present additional experimental results (“Appendix F”) and report scalability of the methods (“Appendix G”).
PFSMs for data types
In this work, we use five regular data types including integers, strings, floats, Booleans and dates; and two noisy data types, namely missing and anomaly.
1.1 Integers
Please see Sect. 2.1 for a detailed discussion of the PFSM used to represent integers.
1.2 Floats
A floating-point number usually consists of digits and a full stop character, which is followed by another set of digits. However, they can also be written without any fractional component, i.e., as integer numbers. We also support the representations of floating-point numbers with e or E. Lastly, we support the use of comma for the thousands separator in floating-point numbers, such as 1, 233.15, 1, 389, 233.15, etc.
1.3 Strings
The string PFSM is constructed with one initial state and one final state. Through each transition, either a digit, an alpha character, or a punctuation character is emitted. The punctuation characters considered here are ., ,, -, _, %, : and ;, which are commonly found in real-world data sets to represent columns with the string type.
1.4 Booleans
Our machine supports the following values by assigning them non-zero probabilities: Yes, No, True, False, 1, 0, -1 and their variants yes, Y, y, no, true, false.
1.5 Dates
We categorize date formats into two groups, which are detailed below:
1.5.1 ISO-8601
We support values in YYYY-MM-DDTHH:MM::SS, where T is the time designator to indicate the start of the representation of the time of day component. We also support other ISO-8601 formats such as YYYYMMDD, YYYY-MM-DD, HH:MM and HH:MM:SS.
1.5.2 Nonstandard formats
We treat years in YYYY format as date type. To distinguish years and integers, we restrict this format to the range of [1000-2999]. On the other hand, we do not explicitly constrain the month (MM) and day columns (DD) to valid ranges, and but treat them as integers. We support ranges of years with the formats of YYYY–YYYY, YYYYYYYY, YYYY–YYYY, YYYY–YYYY and YYYY–YYYY. Lastly, We support dates written as MM-DD-YYYY HH:MM:SS AM/PM and months, e.g., January, February, etc.
1.6 Missing
The machine for missing data assigns non-zero probabilities to the elements of this set, including Null, NA and their variants such as NULL, null, “NA ”, NA, “N A”, N/A, “N/ A”, “N /A”, N/A, #NA, #N/A, na, “ na”, “na ”, “n a”, n/a, N/O, NAN, NaN, nan, -NaN and -nan; special characters such as -, !, ?, * and .; integers such as 0, -1, -9, -99, -999, -9999 and -99999; and characters denoting empty cells such as “” and “ ”.
1.7 Anomaly
We use all of the Unicode characters in this machine’s alphabet, including the accented characters. Note that the number of elements in this set is 1,114,112.
Data sets
We share the available data sets and the corresponding annotations at https://goo.gl/v298ER. Here, we briefly describe these data sets, and provide a list in Table 4 which denotes their sources, and whether they are used in the training or testing phase.
-
Accident 2016 information on accidents casualties across Calderdale, including location, number of people and vehicles involved, road surface, weather conditions and severity of any casualties.
-
Accidents 2015 a file from Road Safety data about the circumstances of personal injury road accidents in GB from 1979.
-
Adult a data set extracted from the U.S. Census Bureau database to predict whether income exceeds $50K/yr.
-
Auto a data set consisting of various characteristics of a car, its assigned insurance risk rating and its normalized losses in use.
-
Broadband annual survey of consumer broadband speeds in the UK.
-
Billboard a data set on weekly Hot 100 singles, where each row represents a song and the corresponding position on that week’s chart.
-
Boston Housing a data which contains census tracts of Boston from the 1970 census.
-
BRFSS a subset of the 2009 survey from BRFSS, an ongoing data collection program designed to measure behavioral risk factors for the adult population.
-
Canberra Observations weather and climate data of Canberra (Australia) in 2013.
-
Casualties 2015 a file from Road Safety data about the consequential casualties.
-
Census Income KDD a data set that contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.
-
CleanEHR (Critical Care Health Informatics Collaborative) anonymised medical records.Footnote 9
-
Cylinder Bands a data set used in decision tree induction for mitigating process delays known as “cylinder bands” in rotogravure printing.
-
data.gov 9 CSV files obtained from data.gov, presenting information such as the age-adjusted death rates in the U.S., Average Daily Traffic counts, Web traffic statistics, the current mobile licensed food vendors statistics in the City of Hartford, a history of all exhibitions held at San Francisco International Airport by SFO Museum, etc.
-
EDF Stocks EDF stocks prices from 23/01/2017 to 10/02/2017.
-
El Niño: a data set containing oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.
-
FACA Member List 2015 data on Federal Advisory Committee Act (FACA) Committee Member Lists.
-
French Fries a data set collected from a sensory experiment conducted at Iowa State University in 2004 to investigate the effect of using three different fryer oils on the taste of the fries.
-
Fuel Fuel Economy Guide data bases for 1985-1993 model.
-
Geoplaces2 information about restaurants (from UCI ML Restaurant & consumer data).
-
HES (Household Electricity Survey) time series measurements of the electricity use of domestic appliances (to gain access to the data, please register at https://tinyurl.com/ybbqu3n3).
-
Housing Price a data set containing 79 explanatory variables that describe (almost) every aspect of residential homes in Ames, Iowa.
-
Inspection Outcomes local authority children’s homes in England - inspection and outcomes as at 30 September 2016.
-
Intel Lab: a data set collected from sensors deployed in the Intel Berkeley Research lab, measuring timestamped topology information, along with humidity, temperature, light and voltage.
-
mass.gov 4 CSV files obtained from mass.gov, which is the official website of the Commonwealth of Massachusetts.
-
MINY Vendors information on “made in New York” Vendors.
-
National Characteristics information on the overall, authorised, unauthorised and persistent absence rates by pupil characteristics.
-
One Plus Sessions information on the number of enrollments with one or more session of absence, including by reason for absence.
-
Pedestrian a count data set collected in 2016, that denotes the number of pedestrians passing within an hour.
-
PHM Collection information on the collection of Powerhouse Museum Sydney, including textual descriptions, physical, temporal and spatial data as well as, where possible, thumbnail images.
-
Processed Cleveland a data set concerning heart disease diagnosis, collected at Cleveland Clinic Foundation (from the UCI ML Heart Disease Data Set).
-
Sandy Related Hurricane Sandy-related NYC 311 calls.
-
Reported Taser 2015: a hand-compiled raw data set based on forms filled out by officers after a stun gun was used in an incident, provided by CCSU’s Institute for Municipal and Regional Policy.
-
Rodents the information collected on rodents during a survey.
-
Survey: a data set from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.
-
TAO a real-time data collected by the Tropical Atmosphere Ocean (TAO) project from moored ocean buoys for improved detection, understanding and prediction of El Niño and La Niña.
-
Tb a tuberculosis dataset collected by the World Health Organisation which records the counts of confirmed tuberculosis cases by “country”, “year” and demographic group.
-
Tundra Traits measurements of the physical characteristics of shrubs in the arctic tundra.
-
T2Dv2 Gold Standard a set of data Web tables to evaluate matching systems on the task of matching Web tables to the DBpedia knowledge base.
-
User Profile information about consumers (from UCI ML Restaurant & consumer data).
-
Vehicles 2015 a file from Road Safety data about the types of vehicles involved in the accidents.
-
83492acc-1aa2-4e80-ad05-28741e06e530: a hypoparsr data set which contains information on expenses.
Note that the data sets from mass.gov and data.gov are obtained from Abdulhakim A. Qahtan and also used in [16].
Derivations for the training
The task is to update the parameters of the PFSMs, given a set of columns X and their column types \(\mathbf{t} \). Since the columns are assumed to be independent, the gradient can be calculated by summing the gradient of each column. In the interest of simplicity, here we only derive the gradient for a column \(\mathbf{x} \) of type k. We would like to maximize the posterior probability of the correct type k given a data column \(\mathbf{x} \), which can be rewritten as follows:
We now present the derivations of the gradients w.r.t. the transition parameters where \(\theta ^\tau _{q,\alpha ,q'}\) denotes the transition parameter from state q to \(q'\) emitting the symbol \(\alpha \) in the \(\tau \)th PFSM. Note that \(\tau \in \{1,\dots ,K\}\) where K is the number of PFSMs.
We now derive these two terms, in section (4.1) and (4.2) respectively.
1.1 Derivative of \(L_c\)
When \(\tau \) is not equal to k, Eq. (2) becomes 0. On the other hand, if \(\tau =k\), then we would need to calculate \(\frac{\partial p(x_i|z_i=\tau )}{\partial \theta _{q,\alpha ,q'}^\tau }\), where \(p(x_i|z_i=\tau )\) can be rewritten as \(\sum _{q_{0:L}} p(x_i,q_{0:L}|z_i=\tau )\) as \(x_i\) is generated by a PFSM. Note that \(q_{0:L}\) denotes the states visited to generate \(x_i\). The derivative can be derived as follows:
Hence, we need to evaluate the joint probability \(p(q_l=q, q_{l+1}=q', x_i|z_i=\tau )\) for each l where \(x_i^l=\alpha \), which can be found by marginalizing out the variables \(q_{0:L \setminus \{l,l+1\}}\):
where \(l'\) denotes \(\{0:L\} \setminus \{l,l+1\}\). This probability can be calculated iteratively via Forward-Backward Algorithm where the forward and backward messages are defined iteratively as follows:
We can then rewrite \(p(q_l=q, q_{l+1}=q', x_i|z_i=\tau )\) as follows:
where \(\bullet \) and \(\odot \) denote respectively outer and element-wise product.
1.2 Derivative of \(L_f\)
Let us now take the derivative of the second term \(L_f\):
Let us now put all the equations together. When we are calculating the derivative of Eq. (1) w.r.t. the correct machine, i.e., \(\tau =k\), the derivative becomes the following:
When we are calculating the derivative of Eq. (1) w.r.t. the wrong machines, i.e., \(\tau \ne k\) this becomes:
Lastly, we ensure the parameters remain positive and normalized using the softmax function. We define \(T_\tau (q,\alpha ,q') = \exp {T_\tau ^z(q,\alpha ,q')}/(\exp {F_\tau ^z(q)} + \sum _{\alpha ',q''} \exp T_\tau ^z(q,\alpha ',q'') )\) and \(I_\tau ^z(q) = \exp {I_\tau ^z(q)}/\sum _{q'} \exp {I_\tau ^z(q')}\). We now update these new unconstrained parameters using the new gradient calculated via the chain rule: \(\partial f/\partial T_\tau ^z(q,\alpha ,q') = (\partial f/ \partial T_\tau (q,\alpha ,q')) (\partial T_\tau (q,\alpha ,q') /\partial T_\tau ^z(q,\alpha ,q'))\).
Derivations for inference
The posterior distribution of column type t can be derived as follows:
Let us assume that \(t=k\) according to \(p(t|\mathbf{x} )\), the posterior distribution of column type. Then we can write the posterior distribution of row type \(z_i\) given \(t=k\) and \(\mathbf{x} \) as:
The outputs of the PADS library
We have mentioned previously that the outputs generated by the PADS library do not directly address our problem. We present a sample from an example test dataset in Table 5, and a part of the corresponding output of the PADS library in Fig. 7.
The outputs are interpreted starting from the bottom. In this case, the data is defined as an array of “struct” type named Struct_194. This is further characterized as a combination of various “union” types. For example, let us consider the first one named Union_19 which consists of a constant string \( \texttt { \& nbsp;}\), another constant string \(\texttt {year}\) and and integer type. However, this can be more complicated as in type Union_165 consisting of two struct types Struct_192 and Struct_164. Note that the former is further divided into a union type, whereas the latter is described as a combination of some constant strings and a float type. As the reader can see, it can become difficult and time-consuming to interpret an output. Moreover, the output becomes more complex when delimiters are inferred correctly, as this can prevent the types from column specific.
Additional experimental results
Table 6 presents the comparisons with hypoparsr.
Figure 8 presents the normalized confusion matrices for the methods, discussed in the paper.
Scalability of the methods
Table 7 denotes the number of rows, columns, unique elements and the time passed to infer column types.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ceritli, T., Williams, C.K.I. & Geddes, J. ptype: probabilistic type inference. Data Min Knowl Disc 34, 870–904 (2020). https://doi.org/10.1007/s10618-020-00680-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00680-1