1 Introduction

Technological advances have allowed storage and analysis of large amounts of data and have given industry and government the opportunity to gain insights from thousands of digital records collected about individuals each day (Matz and Netzer, 2017). These “big behavioral data”—characterized by large volume, variety, velocity and veracity, and defined as data that capture human behavior through the actions and interactions of people (Shmueli, 2010)—have led to predictive modeling applications in areas such as fraud detection (Vanhoeyveld et al., 2019), financial credit scoring (Martens et al., 2007; De Cnudde et al., 2018; Tobback and Martens, 2019), marketing (Verbeke, 2011; Matz and Netzer, 2017; Chen et al., 2017) and political science (Praet et al., 2018). Sources of behavioral data include, but are not limited to, transaction records, search query data, web browsing histories, social media profiles, online reviews, and smartphone sensor data (e.g., GPS location data). Textual data are also increasingly available and used. Example text-based applications are automatic identification of spam emails (Attenberg et al., 2009), objectionable web content detection (Martens and Provost, 2014) and legal document classification (Chhatwal et al., 2019), just to name a few examples.

Behavioral and textual data are very high-dimensional compared to traditional data, which are primarily structured in a numeric format and are relatively low-dimensional (Moeyersoms et al., 2016; Matz and Netzer, 2017; De Cnudde et al., 2020). Consider the following example to illustrate these characteristics: the prediction of personality traits of users based on the Facebook pages they have “liked” (Kosinski et al., 2013; Matz and Netzer, 2017). A user is represented by a binary feature for each unique Facebook page that exists, with a 1 if that page was liked by the user and 0 otherwise, which results in an enormous feature space. However, each user only liked a relatively small number of pages, which results in an extremely sparse data matrix (almost all elements are zero). In the literature, because of their specific nature, behavioral and textual data are often referred to as “fine-grained” (Martens and Provost, 2014; Martens et al., 2016; De Cnudde et al., 2020). For this reason, in this article we mathematically represent behavioral and textual data as \(X_{FG}\) \(\subset\) \(\mathbb {R}^{n \times m}\), where FG stands for “fine-grained”, and n and m refer to the number of instances and features respectively. These features can be binary (e.g., someone “liked” a Facebook page or not) or numerical (e.g., tf-idf vectorization for text documents).

Learning from behavioral and textual data can result in highly accurate prediction models (Junqué de Fortuny et al., 2013; De Cnudde et al., 2020). A drawback of prediction models trained on these types of data, however, is that they can become very complex. The complexity arises from either the learning technique (e.g., deep learning) or the data, or both. It is essentially impossible to interpret classifications of nonlinear techniques such as Random Forests or deep neural networks without using interpretation techniques like rule-extraction—on which the solution proposed in this article is based—or feature importance methods (e.g., LIME (Ribeiro et al., 2016) or TreeSHAP (Lundberg and Lee, 2017)). For linear models or decision trees, the most common approach to understand the model is to examine the estimated coefficients or to inspect the paths from root to leaf nodes. In the context of behavioral and textual data, however, even linear models are not straightforward to interpret because of the large number (thousands to millions) of features each with their corresponding weight (Martens and Provost, 2014; Moeyersoms et al., 2016). Moreover, one may question the comprehensibility of decision trees with thousands of leaf nodes. Alternatively, for linear models, we could inspect only the features with the highest estimated weights. But for sparse data, this means that only a small fraction of the classified instances are actually explained by these features, because of the low coverage of the top-weighted features (Martens and Provost, 2014; Moeyersoms et al., 2016). Kosinski et al. (2013), for example, explain models that predict personal traits using over 50,000 Facebook “likes” by listing the pages that are most related to extreme frequencies of the target classes. For example, the best predictors for high intelligence include Facebook pages “The Colbert Report”, “Science” and “Curly Fries” (Kosinski et al., 2013). Because of the extreme sparsity of the data (users liked on average 170 out of 55,814 possible pages), these pages are only relevant to a small fraction of users predicted as “highly intelligent”, which questions the practicality of this approach for better understanding (global) model behavior.

It is important to note that the high-dimensional, sparse nature of behavioral and textual data alone does not necessarily lead to complex prediction models. If many behavioral or textual features are irrelevant for the prediction task, applying dimensionality reduction or feature input selection prior to modeling, or using strong model regularization can result in models having high predictive performance, while still being interpretable. However, previous research shows that all of these techniques result in worse predictive performance compared to models that exploit the full set of behavioral or textual features for making predictions (Joachims, 1998; Junqué de Fortuny et al., 2013; Clark and Provost, 2015; Martens et al., 2016; De Cnudde et al., 2020). By means of a learning curve analysis on a benchmark of 41 behavioral data sets, De Cnudde et al. (2019) demonstrate that, when mining text or behavior, many features contribute to the predictions. Similar results have been found by Clark and Provost (2015) and Junqué de Fortuny et al. (2013) for behavioral data, and by Joachims (1998) for the analysis of textual data. In other words, the dimensionality and sparsity of the data combined with many relevant features drive the “black-box” nature of any model trained on behavioral and textual data. We represent a classification model trained on behavioral or textual data \(X_{FG}\) as \(C_{BB}\), where BB stands for “black-box”.

Explainability has emerged recently as a key business and regulatory challenge for machine learning adoption. The relevance of global interpretability of classification models is well-argued in the literature (Andrews and Diederich, 1995; Diederich, 2008; Martens et al., 2007; Junqué de Fortuny and Martens, 2015).Footnote 1 In the process of extracting knowledge from data, the predictive performance of classification models alone is not sufficient as human users need to understand the models to trust, accept and improve them (Van Assche and Blockeel, 2007). Both the United States and the European Union are currently pushing towards a regulatory framework for trustworthy Articial Intelligence, and global organizations such as the OECD and the G20 aim for a human-centric approach (European Comission, 2020). In high-stake application domains, explanations are often legally required. In the credit scoring domain, for example, legislation such as the Equal Credit Opportunity Act in US Federal Law (US Federal Trade Commission, 2003) prohibits creditors from discrimination and requires reasons for rejected loan applications. Also in lower-stakes applications, such as (psychologically) targeted advertising (Matz and Netzer, 2017; Moeyersoms et al., 2016) or churn prediction (Verbeke et al., 2012), explanations are managerially relevant. Global interpretability allows to verify the knowledge that is encoded in the underlying models  (Andrews and Diederich, 1995; Huysmans et al., 2006). Models trained on big data may learn incorrect trends, overfit the data or perpetuate social biases (Chen et al., 2017). Furthermore, explanations might give users more control of their virtual footprint.  Matz et al. (2020) argue that insight into what data is being collected and the inferences that can be drawn from it, allow users to make more informed privacy decisions (Matz et al., 2020). Moreover, global model explainability can help to induce new insights or generate hypotheses (Andrews and Diederich, 1995; Shmueli, 2010).

Rule-extraction has been proposed in the literature to generate global explanations by distilling a comprehensible set of rules (hereafter referred to as “explanation rules”) from complex classifiers \(C_{BB}\). The complexity of the set of rules is largely restricted so that the resulting explanation is understandable to humans. Rule-extraction in the context of big behavioral and textual data can be challenging, and, to our knowledge, has thus far received scant attention. Because of the data characteristics, rule-extraction might fail in their primary task (providing insight on the black-box model) as the complex model needs to be replaced by a set of hundreds or even thousands of explanation rules (Huysmans et al., 2006; Sushil et al., 2018).

This article addresses the challenge of using rule-extraction to globally explain classification models on behavioral and textual data. Instead of focusing on rule-extraction techniques themselvesFootnote 2, this article leverages an alternative higher-level feature representation \(X_{MF}\) \(\subset\) \(\mathbb {R}^{n \times k}\), where MF stands for “metafeatures” and k represents the number of metafeatures. Metafeatures are expected to improve the fidelity (approximation of the black-box classification model), explanation stability (same explanations for slightly different training sessions—a concept we introduce, which we will be calling just stability) and accuracy (correct predictions of the original instances) of the extracted explanation rules. For simplicity, we only focus on classification problems. Our main claim is that metafeatures are more appropriate, in specific ways we discuss, for extracting explanation rules than the original behavioral and textual features that are used to train the model.

This article’s main contributions are threefold: (1) we propose a novel methodology for rule-extraction by exploring how higher-level feature representations (metafeatures) can be used to increase global understanding of classification models trained on fine-grained behavioral or textual data; (2) we define a set of quantitative criteria to assess the quality of explanation rules in terms of fidelity, stability, and accuracy; and empirically study the trade-offs between these; and lastly, (3) we perform an in-depth empirical evaluation of the quality of explanations with metafeatures using nine data sets, and benchmark their performance against the explanation rules extracted with the behavioral or textual features. We aim to answer the following empirical questions:

  • How do explanation rules extracted with metafeatures compare against rules extracted with the fine-grained behavioral and textual features across different evaluation criteria (fidelity, stability, accuracy)?

  • How does the fidelityFootnote 3 of explanation rules vary for different complexity settings?

  • To what extent do the fidelity and stability of explanation rules extracted with metafeatures depend on a key parameter of our metafeatures-based rule-extraction methodology, that is the parameter k that represents the number of metafeatures?

2 Related work

2.1 Rule-extraction

In the Explainable Artificial Intelligence (XAI) literature, rule-extraction falls within the class of post-hoc explanation methods that use “surrogate models” to gain understanding of the learned relationships captured by the trained model (Martens et al., 2007; Murdoch et al., 2019). The idea of surrogate modeling is to train a comprehensible surrogate model (the white-box \(C_{WB}\)) to mimic the predictions of a more complex, underlying black-box modelFootnote 4\(C_{BB}\) (Diederich, 2008). We define a black-box model as a complex model from which it is not straightforward for a human interpreter to understand how predictions are made. In this article, we consider any classification model (linear, rule-based or nonlinear) utilising a large number of features as a black-box (because of the specific nature of the data described in the Introduction, a large number of features are typically used in the final models); which is different from previous research that only considers highly-nonlinear models as black-boxes (Andrews and Diederich, 1995; Martens et al., 2007; Diederich, 2008; Martens et al., 2009).

In the machine learning literature, small decision trees and rule-based models with few rules have been argued to yield the most comprehensible classification models (Van Assche and Blockeel, 2007; Freitas, 2013), making them good candidates to use as white-box models \(C_{WB}\) to extract a set of explanation rules (known as “rule-extraction”).Footnote 5 It is important to note that the complexity of the rules needs to be restricted so that the resulting explanation is comprehensible to humans (Martens et al., 2007, 2009).

Rule-extraction can be used for two purposes. First and foremost, one may be interested in knowing the rationale behind decisions made by a classification model \(C_{BB}\) and verify whether the results make sense in practice. The goal is to extract comprehensible rules that closely mimic the black-box, that is measured by what is called “fidelity”. Alternatively, the goal can be to improve the “accuracy”, namely, the generalization performance of a white-box model (e.g., a small decision tree or a concise set of rules) by approximating the black-box  (Martens et al., 2009; Huysmans et al., 2006). In this article, we discuss most results in terms of fidelity instead of accuracy as our focus is on developing global explanations that “best mimic the black-box”—but all our analyses can also be done using accuracy as the main metric.

Rule-extraction methods use the mapping of the data to the predicted labels, i.e., the input-output mapping defined by the model \(C_{BB}\) (Andrews and Diederich, 1995; Martens et al., 2007; Huysmans et al., 2006). The idea behind this approach is that the similarity between the black-box and white-box model (the fidelity) can be substantially improved by presenting the labels predicted by the black-box model \(\hat{\mathbf{y }}\) \(=\){\(\hat{y}_{i}\)}\(_{i=1}^n\) to the white-box model, instead of the ground-truth labels \(\mathbf{y}\) \(=\){\(y_{i}\)}\(_{i=1}^n\) (Martens et al., 2009; Junqué de Fortuny and Martens, 2015).

2.2 Challenges of rule-extraction for high-dimensional data

The vast majority of the rule-extraction literature has focused on improving the fidelity and scalability of rule-learning algorithms. However, despite some very impressive and promising work (Andrews and Diederich, 1995; Martens et al., 2007, 2009; Diederich, 2008; Junqué de Fortuny and Martens, 2015), the rule-extraction techniques are mostly validated on low-dimensional, dense data, such as the widely used set of benchmark data from the UCI Machine Learning repository (Bache and Lichman, 2020). These data have feature dimensions going up to 50 features. We identify at least three challenges in regard to rule-extraction to explain classifiers on fine-grained behavioral and textual data:

  1. 1.

    Complexity of the explanation rules. In the context of high-dimensional data with many relevant features, rule-extraction might fail to provide insight on the black-box model as the black-box model needs to be replaced by a large set of rules (Martens et al., 2007, 2009; Huysmans et al., 2006). Sushil et al. (2018) applied rule-extraction on (real-world) textual data and show that rule learners can closely approximate the underlying model, but at the cost of being very complex (hundreds of rules). A related challenge that stems from rule-based learners not being very adept at handling high dimensionality, is their high variance profile that can result in overfitting (Kotsiantis et al., 2006; De Cnudde et al., 2020).

  2. 2.

    Computational complexity. It is not straightforward for every existing rule-learning algorithm to be used for high-dimensional data, because the learning task can become computationally too demanding (Andrews and Diederich, 1995; Sushil et al., 2018). Some algorithms, such as Ripper (Cohen, 1995), are not able to computationally deal with problem instances having large-scale feature spaces (Sushil et al., 2018).

  3. 3.

    Fine-grained feature comprehensibility. Diederich (2008) questions the usefulness of rules extracted from models trained on behavioral or textual data. For example, when rules are learned from a model initially trained on a “bag–of–words” representation of text documents, the antecedents in a rule include individual words taken out of their context. This can reduce the semantic comprehensibility of the explanations. Likewise, for digital trace data, we can question the comprehensibility of a single action (e.g., a single credit card transaction, a single Facebook “like”) taken out of its context, that is, the collection of all behaviors of an individual.

Because of the above challenges, it is questionable whether fine-grained behavioral and textual features are the best representation for extracting global explanation rules to achieve the best explanation quality in terms of fidelity, stability, and accuracy. This motivates our approach to use a metafeatures representation instead. It is not clear a priori whether such a representation will improve the quality of explanations of models on behavioral and textual data, making this a key empirical question that we study in this article.

3 Metafeatures

3.1 Motivation

As previously introduced, behavioral and textual data suffer from high dimensionality and sparsity. For this reason, the features individually may exhibit little discriminatory power to explain the black-box model. Because of the low coverage that characterizes such sparse features, a single feature is not expected to “explain” much of the classifications of the underlying model. The feature will only be active (nonzero) for a small fraction of all data instances, and therefore, the coverage of an explanation rule with a single behavioral or textual feature is likely to be low (Sommr, 1995; Martens and Provost, 2014; Chen et al., 2016; Sushil et al., 2018).Footnote 6

We address the data sparsity by mapping the fine-grained, sparse features \(X_{FG}\) \(\subset\) \(\mathbb {R}^{n \times m}\) onto a higher-level, less-sparse feature representation \(X_{MF}\) \(\subset\) \(\mathbb {R}^{n \times k}\) (to which we refer as “metafeatures”), where m and k respectively represent the dimensionality of the original features and metafeatures. Existing research has experimented with the idea of using higher-level features other than the actual features used by the model to extract explanations  (Ribeiro et al., 2016; Chen et al., 2016; Kim et al., 2018; Lee et al., 2019). In the field of image recognition, for example, the input pixels are not straightforward to interpret, hence researchers have proposed to use a patch of similar pixels (super-pixels) for generating explanations of image classifications (Ribeiro et al., 2016; Wei et al., 2018). Another example stems from the field of natural language processing, where Chen et al. (2016) cope with data sparsity by clustering similar features by their frequency in large data sets. All of these approaches have, however, not been used before in the context of rule-extraction for models on big behavioral and textual data.

3.2 Desired properties

We propose the following set of properties for engineering metafeatures:

  1. 1.

    Low dimensionality. We want the dimensionality k of the metafeatures to be smaller than the dimensionality m of the original feature space: \(k\) \(<<\) \(m\). A lower feature dimension may lead to more stable explanation rules (Alvarez-Melis and Jaakkola, 2018). Moreover, the computational burden for extracting rules with metafeatures is likely to be much lower compared to rule-extraction with high-dimensional data (Andrews and Diederich, 1995; Sushil et al., 2018).

  2. 2.

    High density. This property relates to the coverage of a metafeature, which we want to be higher compared to the coverage of fine-grained features (Chen et al., 2016). In other words, there should be more instances for which a metafeature is active (nonzero) compared to the fine-grained features. The higher density (lower sparsity) of the metafeatures is expected to increase the fidelity and accuracy of explanation rules resulting from the higher coverage of rules predicting the non-default target class (often the “class of interest”).

  3. 3.

    Faithful to the original feature representation. This property is in line with prior research suggesting that the representation of the original data instances in terms of metafeatures should preserve relevant information to discriminate between the predicted labels \(\hat{\mathbf{y }}\) (Alvarez-Melis and Jaakkola, 2018). The metafeatures should preserve the predictive information of the original features as the black-box model is trained on the latter. It is important that the extracted rules using metafeatures can reach a high level of discriminatory power in regard to the true predictions being made, because this will result in a better approximation of the underlying model as measured by fidelity. In the experiments, we use the test fidelity of the explanations as a proxy to measure the faithfulness of the metafeatures \(X_{MF}\) to the original features \(X_{FG}\). In addition, we use the Gini index to measure the predictive information captured by each metafeature.Footnote 7

  4. 4.

    Semantic comprehensibility. Metafeatures should have a human-comprehensible interpretation (Alvarez-Melis and Jaakkola, 2018). For example, Facebook “likes” can be grouped into semantically meaningful categories (e.g., “Entrepreneurship”) and GPS location data can be categorized into venue types (e.g., “Concert halls”). This property is subjective in nature and depends on the application domain and the expectations of users who interact with the model (explanations) (Wood, 1986; Campbell, 1988; Huysmans et al., 2006; Huysmans et al., 2011). We do not explicitly measure the comprehensibility of explanations with (meta)features, as this would require experimentation with people, an important research direction to explore if indeed metafeatures improve the quality of explanations in the other dimensions we study here (fidelity, stability, accuracy). In this article, we make the assumption that the resulting metafeatures are semantically meaningful. In Sect. 4.5, we demonstrate how metafeatures generated with a data-driven method (e.g., Non-negative Matrix Factorization) can be interpreted, based on common practices described in the literature (Wang and Blei, 2011; O’Callaghan et al., 2015; Contreras-Pina and Sebastián, 2016; Kulkarni et al., 2018; De Cnudde et al., 2019). Note that the metafeatures that are manually crafted (the “domain-based” metafeatures described in Sect. 4.2) are, by design, comprehensible to humans.

4 Metafeatures-based explanation rules

We introduce and validate a methodology to extract explanation rules from a complex model \(C_{BB}\) trained on behavioral and textual data \(X_{FG}\). The steps of the proposed methodology are summarized in Fig. 1 and discussed below.

Fig. 1
figure 1

Proposed rule-extraction methodology using metafeatures

4.1 Model building and predicting labels

From the behavioral and textual data \(X_{FG}\) (having n instances and m features) we train and test the black-box model \(C_{BB}\). The model is trained on a subset of \(\alpha\) \(\times\) \(\beta\) \(\times\) \(n\) instances (the training set \(X_{FG,train}\) with corresponding labels \(\mathbf{y} _{{\varvec{train}}}\)) and hyperparameters are optimized using a holdout set of \(\alpha\) \(\times\) \((1-\beta )\) \(\times\) \(n\) instances (the validation set \(X_{FG, val}\) with labels \(\mathbf{y} _{{\varvec{val}}}\)). Finally, the generalization performance of the black-box model is evaluated on an unseen part of the data (the test set \(X_{FG, test}\) with labels \(\mathbf{y} _{{\varvec{test}}}\)) that contains \((1-\alpha )\) \(\times\) \(n\) instances.Footnote 8 The trained black-box model is used to make predictions \(\hat{\mathbf{y }}\) for all instances in the data set (training, validation and test data), which will thereafter be used to train, fine tune and test our white-box model \(C_{WB}\) (the explanation rules).

4.2 Generating metafeatures \(X_{MF}\)

We need to specify a feature transformation process to group behavioral and textual features \(X_{FG}\) with similar properties together in metafeatures \(X_{MF}\) (Chen et al., 2016). There are various approaches for generating metafeatures from the original features, either by manually crafting them using domain knowledge (Murdoch et al., 2019) or by automatically obtaining them by means of data-driven feature engineering techniques, such as (un)supervised dimensionality reduction. In the following, we use DomainMF and DDMF as abbreviations for domain-based metafeatures and data-driven metafeatures respectively.

4.2.1 Domain-based metafeatures

One way of generating metafeatures from the original features is to group features together in domain-based categories that are manually crafted by experts (Murdoch et al., 2019; Alvarez-Melis and Jaakkola, 2018). For example, for the Facebook “like” data, individual Facebook pages can be grouped together in predetermined categories, for example, pages related to “Machine Learning”. This human-selected set of metafeatures can then be used to extract simple rules to explain model predictions, which represent the relative importance of these domain-based metafeatures in the prediction model. In this article, we mathematically denote the domain-based metafeatures as \(X_{DomainMF}\) \(\subset\) \(\mathbb {R}^{n \times k}\).

4.2.2 Data-driven metafeatures

Alternatively, metafeatures can be generated via a data-driven approach, such as matrix factorization-based dimensionality reduction.Footnote 9 The idea is to increase density by representing the data in a lower dimensional space without too much loss of information. The original data matrix \(X_{FG}\) with n unique instances and m unique features is split into two matrices \(L_{n \times k}\) and \(R_{k \times m}\) such that: \(X_{FG}\) \(\approx\) \(L\) \(R\). The k columns of L represent the metafeatures, and each instance will have a representation in the new k-dimensional space. The matrix R, represents the relationship between the new metafeatures and the original features (Clark and Provost, 2015).

Metafeatures group together related features. The quality of the metafeatures depends on the number of extracted metafeatures k: a value of k that is set too high results in many highly-similar categories, whereas a low value of k tends to generate overly-broad metafeatures. The intended goal of generating metafeatures in this article is to use them for rule-extraction, and consequently, we optimize the number of k such that the out-of-sample fidelity of the rules is maximal (we use a validation set to fine tune the value of k). We consider values of k from 10 up to 1000 (Clark and Provost, 2015). Note that we should not be concerned with generating too many metafeatures because we only need to interpret the ones that are part of the final explanation rules (this is demonstrated in Sect. 4.5).

For generating metafeatures based on matrix-factorization-based dimensionality reduction, we first approximate the original training data \(X_{FG}\) by two matrices L and R for a given number of metafeatures k (step 1 in Fig. 13 in “Appendix 1”). Matrix R maps each metafeature to the original fine-grained features. We ensure mutual exclusivity by transforming matrix R into a binary matrix \(R_{\scriptscriptstyle binary}\), where 1 represents the maximum element for each column (fine-grained feature) of R and all other elements are 0 (step 2 in Fig. 13). In other words, each feature only belongs to one metafeature. Next, we map the original matrix \(X_{FG}\) to \(X'\) by multiplying \(X_{FG}\) with the transposed binary matrix \(R_{\scriptscriptstyle binary}^{T}\) (step 3 in Fig. 13). Finally, matrix \(X'\) is normalized over the number of active (fine-grained) features per instance (e.g., total number of behaviors or words) to become matrix \(X_{\scriptscriptstyle DDMF-k}\) that represents the metafeatures per instance (step 4 in Fig. 13). We found that the normalized matrix \(X_{\scriptscriptstyle DDMF-k}\) produced better results (as measured by test fidelity of the explanations) than utilizing the original matrix \(X'\) or even a binary matrix derived from \(X'\). We apply the binarization of matrix R to make the resulting explanation rules more interpretable and semantically meaningful. For example, for the Facebook data, the explanation rules with the metafeatures can be interpreted in terms of the percentage of “liked” pages of a category (see Fig. 4). If we would immediately use the matrix L to represent instances in the metafeatures space, and ignore the binarization and normalization steps, the explanation rules would contain logical statements that are not immediately comprehensible. For example, it would be difficult to interpret a rule that says something like, “if the value of this metafeature is higher than 0.3, then the model predicts the person as Female”, when the metafeature is not expressed in terms of an actual unit of measurement.Footnote 10

In this article, we experimented with two well-established dimensionality reduction methods based on matrix factorization: Non-negative Matrix Factorization (NMF) and Singular Value Decomposition (SVD). NMF is applied in multiple domains to decompose a non-negative matrix into two non-negative matrices (Lee and Seung, 2001). In most real-life applications, negative components or subtractive combinations in the representation are physically meaningless. Incorporating the non-negativity constraint thus facilitates the interpretation of the extracted metafeatures in terms of the original data (Wang and Zhang, 2012; Kulkarni et al., 2018; Clark and Provost, 2015). SVD is a popular technique for matrix factorization across a wide variety of domains such as text classification (Husbands et al., 2001) and image recognition (Turk and Pentland, 1991). SVD is computed by optimizing a convex objective function and the solution is equivalent to the eigenvectors of the data matrix. We implemented these dimensionality reduction techniques using Python’s Scikit-learn package (Pedregosa et al., 2011).

An important assumption we make is that the resulting data-driven metafeatures are semantically meaningful. While the obtained metafeatures are not always guaranteed to be interpretable, especially NMF has been shown to provide interpretable results for fine-grained data applications (Contreras-Pina and Sebastián, 2016; Lee and Seung, 1999), as compared to other techniques like SVD. Usually, metafeatures are interpreted by looking at the top-weighted features (Wang and Blei, 2011; O’Callaghan et al., 2015; Contreras-Pina and Sebastián, 2016; Kulkarni et al., 2018; De Cnudde et al., 2019). It is important to note that only the metafeatures that are part of the final explanation rules need to be interpreted.Footnote 11

4.3 Extracting explanation rules

Both rule and decision tree learning algorithms can be used for rule-extraction. Since trees can be converted to rules, we also use tree algorithms for rule-extraction (Martens et al., 2007; Huysmans et al., 2011; Martens, 2008). A full review of these techniques is beyond the scope of this article, but we will shortly describe CARTFootnote 12, as this is the technique used in our experiments.

CART can be used for both classification and regression problems and it uses the Gini index as a splitting criterion, which measures the impurity of nodes. The best split is the one that reduces the impurity the most. We apply CART to the data where the target variable is changed to the black-box predicted class label \(\hat{\mathbf{y }}\) instead of the ground-truth labels \(\mathbf{y}\) (see Sect. 2.1).

The number of extracted explanation rules can be used as a proxy for human comprehensibility.Footnote 13 Restricting the complexity of the rule set is also motivated by research on how people make decisions: based on relatively simple rules to avoid excess cognitive effort (Gigerenzer and Goldstein, 2016; Hauser et al., 2009) due to cognitive limitations (Sweller, 1988). In the context of consumer decision-making for example, Hauser et al. (2009) argue that decision rules should incorporate “cognitive simplicity”: Rule sets should consist of a limited number of rules, each with a small number of antecedents. Finally, it is important to note that the concept of comprehensibility in the context of explanation rules comprises many different aspects, such as the size of the explanation, but also the specific application context and subjective opinion and expectations of the end user, which makes it difficult to measure comprehensbility in a generic way (Huysmans et al., 2011; Campbell, 1988; Wood, 1986). In line with cognitive simplicity arguments (Hauser et al., 2009; Sweller, 1988), in the experiments in Sect. 6, we restrict the complexity of the explanations to at most 32 rules each consisting of at most five antecedents (this is equivalent to a tree depth of at most five).

4.4 Evaluating explanation rules

4.4.1 Fidelity

First and foremost, the explanation rules are evaluated on how well they approximate the classification behavior of the underlying model. Fidelity measures the ability of the rules to mimic the model’s classification behavior from which they are extracted. Let {\(\mathbf {x}_{i}\),\(y_{i}\)}\(_{i=1}^n\) represent the labeled data instances and \(\mathbf{y} ^\mathbf{WB }\) and \(\hat{\mathbf{y }}\) respectively the white-box and black-box predicted labels. Fidelity is expressed as the fraction of instances for which the label predicted by the explanation rules (the white-box predicted label) equals the black-box predicted label (Craven and Shavlik, 1999; Huysmans et al., 2006):

$$\begin{aligned} fidelity^{WB} = \dfrac{| \{\hat{y}_{i}=y_{i}^{WB} | \mathbf {x}_{i}\}_{i=1}^n |}{N} \end{aligned}$$
(1)

While most of our analysis is based on fidelity, we can extend the fidelity to “f-score fidelity” (f-fidelity). The f-fidelity is defined as the harmonic mean between precision and recall of the white-box predictions (w.r.t. the predicted labels \(\hat{\mathbf{y }}\) rather than the true labels \(\mathbf{y}\)). More precisely, the formula of f-fidelity is \(\frac{2 \cdot precision\cdot recall}{precision+ recall}\), where the precision of the classifier is the fraction of positively-predicted instances that is correctly classified and the recall refers to the fraction of positive instances that is correctly classified as a positive. Note that f-fidelity is less intuitive than fidelity, but we use it in the experiments as an additional metric to measure to what extent the explanation rules reflect the black-box model, that is especially interesting for imbalanced problems, i.e., prediction tasks where the distribution of the target variable is skewed (e.g., the 20news data in the experimentsFootnote 14).

4.4.2 Stability

A second important factor is explanation stability—which we will call stability from here on. Users, businesses, or regulators may have a hard time accepting explanations that are unstable (meaning small changes in the data lead to large changes in explanations of the black-box model), even if the explanation has shown to have high fidelity and comprehensibility (Van Assche and Blockeel, 2007).  Turney (1995) distinguishes two types of stability: syntactic and semantic stability. Semantic stability is often measured by estimating the probability that two models learned on different training sets, will give the same prediction to an instance. On the other hand, syntactic stability measures how similar two explanations are (e.g., the overlap of features in two different explanations), and is more specific to a particular explanation representation (Turney, 1995). We argue that syntactic stability is the most relevant type of stability in the context of explaining classification models. To the best of our knowledge, it remains an open question how to measure syntactic stability for different explanation representations, such as rules and trees. We propose a procedure based on the Jaccard coefficient to measure syntactic stability of explanation rules. More specifically, by measuring the overlap of features that are part of the explanations extracted from slightly different subsets of training data.Footnote 15 To compute the stability of explanation rules extracted using different data representations \(\forall\) \(X\) \(\in\){\(X_{FG}\), \(X_{DDMF-k}\), \(X_{DomainMF}\)} and the black-box predicted labels \(\hat{\mathbf{y }}\), we propose the following procedure:

  • Step 1 Generate B samples {\(X_{trainBS,j}\)}\(_{j=1}^B\) from the training data \(X_{train}\) using bootstrapping.Footnote 16

  • Step 2 Extract explanation rules \(C_{WB,j}\) from each bootstrap training sample \(X_{trainBS,j}\) (this can be the fine-grained or metafeature representation) and the corresponding labels \(\hat{\mathbf{y }}_{{\varvec{{trainBS,j}}}}\) predicted by the black-box model. It is important to note that the data-driven metafeatures \(X_{DDMF-k}\) need to be computed again for each bootstrap training sample. Obtain B explanations and keep track of the features that are part of the explanations in B sets of features {\(F_{j}\)}\(_{j=1}^B\).

  • Step 3 Make \(\frac{B!}{2!(B-2)!}\) pairwise comparisons of the extracted explanations using the Jaccard coefficient. For two sets of features \(F_{v}\) and \(F_{w}\) (respectively representing the features in explanations \(C_{WB,v}\) and \(C_{WB,w}\)), the Jaccard coefficient is defined as: \(J(F_{v},F_{w}) = |F_{v}\cap F_{w}| / |F_{v} \cup F_{w}|\). The Jaccard coefficient equals 1 if the sets are equal (the explanations have perfect overlap of features) and 0 if they are disjoint (the explanations are completely different). For the data-driven metafeatures, two metafeatures are considered to have the same interpretation when the Jaccard coefficient computed for two metafeatures, as measured over the top features with the highest weight, exceeds a cut-off value of c.Footnote 17

  • Step 4 Compute the average (pairwise) Jaccard coefficient over \(\frac{B!}{2!(B-2)!}\) comparisons.

Prior research has shown that explanations that rely on high-dimensional data tend to be less robust compared to methods that operate on higher-level features (Alvarez-Melis and Jaakkola, 2018). For this reason, we expect that the extracted explanation rules with metafeatures will be more stable over different training sessions compared to the rules with the original behavioral and textual features. It is important to note, however, that for the metafeatures generated with a data-driven approach (\(X_{DDMF-k}\)), the computed stability of the explanations also depends on the metafeature generation method (e.g, NMF). For each bootstrap sample, the data-driven metafeatures are computed again. For the domain-based metafeatures \(X_{DomainMF}\) and the original features \(X_{FG}\), the features do not have to be computed for each bootstrap sample, making this part of the rule-extraction process relatively more stable.

4.4.3 Accuracy

Rule-extraction has also been used to increase the generalization performance of white-box models \(C_{WB}\), as measured by accuracy. Martens et al. (2009) show that rules that mimic the behavior of an underlying, better-performing model can become more accurate compared to the rules learned from the original data and the corresponding ground-truth labels \(\mathbf{y}\). Accuracy is defined as the fraction of correctly classified instances by the explanation rules (Huysmans et al., 2006):

$$\begin{aligned} accuracy^{WB} = \dfrac{| \{y_{i}=y_{i}^{WB} | \mathbf {x}_{i}\}_{i=1}^n |}{N} \end{aligned}$$
(2)

4.5 Examples of explanation rules

In this subsection, we show examples of the explanation rules extracted from classification models (\(\ell _{2}\)-regularized Logistic RegressionFootnote 18) that predict gender from Facebook “like” data (Praet et al., 2018) and movie viewing data (Harper and Konstan, 2015), and predict whether a news post is about “atheism” (Lang, 1995). Moreover, for the explanation rules based on data-driven metafeatures, we demonstrate how to interpret metafeatures that are part of the explanation. For the other data in the experiments (see Sect. 5.1), the semantic meaning of the features is not publicly available, and for this reason, we do not include examples for these prediction models.

Fig. 2
figure 2

Example of explanation rules using the fine-grained Facebook pages \(X_{FG}\) to explain predictions of the \(\ell _{2}\)-LR model for Facebook data. The global explanation tells us, for example, that the \(\ell _{2}\)-LR model tends to predict Facebook users as “Female” when they like the magazine “Flair”, “Lou Reed”, and “Angus and Julia Stone”

Fig. 3
figure 3

Example of explanation rules using the domain-based metafeatures \(X_{DomainMF}\) to explain predictions of the \(\ell _{2}\)-LR model for Facebook data. The explanation tells us, for example, that the \(\ell _{2}\)-LR model tends to predict Facebook users as “Female” when more than 2% of their likes belong to the category “Fashion” and more than 2% of their likes are about “Food”

Fig. 4
figure 4

Example of explanation rules using the data-driven metafeatures \(X_{DDMF-k}\) (k=70) to explain predictions of the \(\ell _{2}\)-LR model for Facebook data. The explanation tells us, for example, that the \(\ell _{2}\)-LR model tends to predict Facebook users as “Female” when less than 2% of their likes belong to the metafeature “Female media”, less than 6% of their likes are about “Cooking” and more than 11% of their likes are related to “Interior Design”

Explanations of the gender prediction model on Facebook data with the fine-grained features (Facebook pages), the domain-based metafeatures (the manually crafted categories) and the data-driven metafeatures are respectively shown in Figs. 23 and 4. Table 4 in “Appendix 2” shows the data-driven metafeatures part of the explanation in Fig. 4, and the top-20 Facebook pages with the highest weights for each. The cluster names at the bottom show our interpretation of each metafeature. There are four metafeatures that group similar Facebook pages like “Female media” and “Interior design”. Note that the process of interpreting metafeatures might require domain-specific knowledge, but we mainly aim to demonstrate how the interpretation of metafeatures is usually done  (Wang and Blei, 2011; O’Callaghan et al., 2015; Contreras-Pina and Sebastián, 2016; Kulkarni et al., 2018; De Cnudde et al., 2019).

Fig. 5
figure 5

Example of explanation rules using the fine-grained words \(X_{FG}\) to explain predictions of the \(\ell _{2}\)-LR model for 20news data. The explanation tells us, for example, that the \(\ell _{2}\)-LR model tends to predict news posts as “Atheism” when the tf-idf values of “atheism”, “atheists” and “morality” are respectively less than 0.01, less than 0.11 and more than 0.09

Fig. 6
figure 6

Example of explanation rules using the data-driven metafeatures \(X_{DDMF-k}\) (k=30) to explain predictions of the \(\ell _{2}\)-LR model for 20news data. The explanation tells us, for example, that the \(\ell _{2}\)-LR model tends to predict the topic of news posts as “atheism” when the values of the metafeatures “Posts from Bob Beauchaine”, “Afterlife”, and “Israeli-Palestinian conflict” are respectively more than 0.09, more than 0.1 and less than 0.035

Fig. 7
figure 7

News post from the 20news data with ground-truth topic “atheism”. 7.29% of the posts about “atheism” are from Beauchaine and contain the same quote “They said that Queens could stay, they blew the Bronx away, and sank Manhattan out at sea”. The words in Metafeature 1 in Table 5 are related to the posts of Beauchaine and clearly represent words of the quote. The model overfitted on the posts of Beauchaine, more specifically, on his name and signature quote, as the explanation in Fig. 6 shows, that contains Metafeature 1

Explanation rules for the Logistic Regression model on the 20news data to predict the topic “atheism” are shown in Figs. 5 and 6 (explanations respectively with the words in a news post and with data-driven metafeatures). Table 5 in “Appendix 3” shows the top-20 words with the highest weight per metafeature (only those part of the explanation in Fig. 6 are shown). For example, there are five metafeatures that group similar words into subtopics like the “Israeli-Palestinian conflict”. Interestingly, the explanation with the data-driven metafeatures shown in Fig. 6 reveals a problem with the model: it is overfitting on the posts from Bob Beauchaine and his signature quote containing words like “Manhattan” and the “Bronx” (Metafeature 1 in Table 5). Figure 7 shows an example news post from Beauchaine. When we generate an explanation with the words in the news posts (see Fig. 5), we are not able to diagnose the overfitting on the posts of Beauchaine so easily. This nicely illustrates a specific use case of metafeatures-based explanations of models trained on behavior or text that goes beyond the question whether DDMF-based explanations are more suitable for these models. It shows how DDMF-based explanations can serve as a “tool” for improving the model or gaining insight from it, that is a complement to, and not necessarily a replacement of, FG-based explanations.

In "Appendix 4", the explanations of a \(\ell _{2}\)-LR model on movie viewing data (Movielens1m) to predict gender are shown. The explanation with data-driven metafeatures tells us that the LR model predicts users as “Female” when at least \(2\%\) of the movies they watched has female (lead) characters and at least \(3\%\) are drama movies (see Fig. 15). The interpretation of the data-driven metafeatures is shown in Table 6.

5 Experimental setup

The experiments in this article explore the performance of explanation rules with metafeatures versus the original features on which the model is trained. We make a distinction between domain-based metafeatures (\(X_{DomainMF}\)) and metafeatures generated with a data-driven method (\(X_{DDMF-k}\)). The dimensionality reduction parameter k determines the number of metafeatures. The parameter k is fixed for the domain-based, but a hyperparameter for the data-driven metafeatures. We evaluate the performance on a suite of classification tasks using nine behavioral and textual data sets. Figure 16 in "Appendix 5" summarizes the experimental procedure for evaluating the fidelity, f-fidelity and accuracy of explanations using five-fold cross-validation (CV), and the explanation stability using bootstrapping.

5.1 Data sets and models

Our experimental data comprise seven behavioral and two textual data sets. The data sets are summarized in Table 1. The Movielens100 and Movielens1m (Harper and Konstan, 2015) data sets contain movie viewing data from the MovieLens website. We focus on the task of predicting the gender of these users. The Airline dataFootnote 19 contains Twitter data about American airlines, and the task is to predict (positive) sentiment. The Facebook “like” data collected by Praet et al. (2018) (Facebook) contains likes from over 6000 individuals in Flanders (Belgium) and is used to predict gender. The Yahoomovies dataFootnote 20 also contains movie viewing behavior, from which we predict gender. The Tafeng data (Hsu et al., 2004) consists of fine-grained supermarket transactions, where we predict the age of customers (younger or older than 30) from the products they have purchased. The 20news data (Lang, 1995) contains about 20,000 news posts. For this data, the task is to predict whether a post belongs to the topic “atheism”, based on the words of the post. Another behavioral data set is the Libimseti data (Brozovsky and Petricek, 2007), which contains data about profile ratings from users of the Czech social network Libimseti.cz. The prediction task is, again, the gender of the users. Lastly, the Flickr data (Cha et al., 2009) contains millions of Flickr pictures and the target variable is the popularity of a picture (the number of comments it has).

Table 1 Characteristics of the data sets: data type (Type: behavioral(B)/textual(T)), classification task (Target), number of instances (Instances), number of features (Features), number of domain-based metafeatures (DomainMF), balance of the target b (fraction of instances with a positive class label), and sparsity of the data \(\rho\) (fraction of zero feature values in the data \(X_{FG}\))

All data have a high-dimensional feature space with up to hundreds of thousands of features. Movielens1m, Movielens100 and Airline have lower-dimensional feature spaces compared to the other data sets. The large sparsity values \(\rho\) for all data indicate that the number of active features is very small compared to the total number of features.

Table 2 Average performance of black-box classification models (\(\ell _{2}\)-LR and RF) on the test data using five-fold CV

We train Logistic Regression models with l2-regularization (\(\ell _{2}\)-LR)Footnote 21 and Random Forest (RF) models with the Scikit-learn library (Python). For training the classification models, we use 80% of the data, and the remaining 20% of the data is used for testing the models. For fine tuning hyperparameters of the model, we use a validation set (20% of the training data). More specifically, the regularization parameter C of the \(\ell _{2}\)-LR model and the number of trees in the RF model are selected based on the accuracy on the validation set. For preprocessing the text data, we remove stopwords and lemmatize tokens using the NLTK package in Python, and then use tf-idfFootnote 22 vectorization (Joachims, 1998; Martens and Provost, 2014).

Measuring accuracy in practice requires discrete class label predictions, which we obtain by comparing the predicted probabilities to a threshold value t and assigning instances with a predicted probability that exceeds this threshold a positive predicted label. In practice, the choice of the threshold value t depends on the costs associated with false positives and false negatives. In this article, the exact misclassification cost are unknown, and for this reason we compute the threshold value t such that the fraction of instances that are classified as positive equals the fraction of positives in the training data (Lessman et al., 2015). Table 2 indicates the generalization performance of all models for each data set over five folds. In addition to the accuracy, we also report the f-score, precision and recall.

To extract explanation rules with the CART algorithm, we use the DecisionTree model of the Scikit-learn library. For controlling the complexity of the extracted rules, or equivalently the depth of the tree, we change the value of the \(max\_depth\) parameter. We let the depth of the tree vary from 1 to 5 such that the explanations are cognitively simple (which we motivated in Sect. 4.3).

We extract explanation rules with the original features \(X_{FG}\) (on which the classification models are trained) and the metafeatures \(X_{MF}\), and the predicted black-box labels \(\hat{\mathbf{y }}\). In the experiments, we generate data-driven metafeatures \(X_{DDMF-k}\) based on two approaches (see Sect. 4.2): NMF and SVD. In the experimental results, we mainly discuss the explanations with DDMF generated via NMF (simply denoted by DDMF), that showed the best (fidelity) results. For the Facebook and Movielens1m data, we also extract explanations with domain-based metafeatures.

6 Experimental results

We compare explanation rules for black-box models extracted with metafeatures against those extracted with fine-grained features, across different classification tasks, data sets and evaluation criteria. As mentioned, our main goal is to better understand how metafeatures affect these different criteria and their trade-offs.

6.1 Are metafeatures better than the original features for explaining models on behavioral and textual data with cognitively simple explanation rules?

Table 3 Evaluation of explanation rules for \(\mathbf {\ell _{2}}\)-LR model using fine-grained features (FG) and data-driven metafeatures (DDMF) with optimal dimensionality reduction parameter k in parentheses

Table 3 shows the performance of explanation rules with FG features and metafeatures for the LR models. One of the first key questions related to the performance of the rules is “what is the fidelity”, as we want our explanations to mimic the black-box as closely as possible. Overall, the results indicate that the fidelity is higher for DDMF than for the FG-based rules (on average, across all data sets, 6.05%). The rules with DDMF achieve a higher number of wins for both the fidelity and f-fidelity (8 wins versus 1). We use a one-tailed Wilcoxon signed-rank test (Demsar, 2006) to make a statistical comparison between the fidelity of rules with DDMF vs FG features. The test is performed with a sample size of 9 data sets. We find a test statistic \(T\) \(=\) \(2\) (which is smaller than the critical value \(T_{c}\) \(=\) \(3\)), hence the difference in fidelity between DDMF and FG is statistically significant at a \(1\%\) significance level. The difference in f-fidelity is statistically significant at a \(5\%\) level (test statistic of \(T\) \(=\) \(5\) compared to a critical value of \(T_{c}\) \(=\) \(8\)).

One notable exception is the 20news data: the FG-based rules outperform the DDMF-based rules, and the fidelity values are very high while the f-fidelity results are comparably low. This is likely because of the severe (predicted) class imbalance (\(b\) \(=\) \(4.24\%\) in Table 1) compared to the other data. For this reason, the fidelity criterion might be less suitable for this specific data set. Instead, we could have optimized the depth of the tree and the k of the DDMF on the f-fidelity as measured on the validation set.Footnote 23

Another prominent observation that is, at least at first sight, unexpected is that the optimal tree depth (explanation complexity) does not always reach the maximum of 5. For example, for the Flickr data, the optimal complexity of FG-based explanations is a depth of 3. For the FG-based explanations of at most 32 rules, we observe only very small differences in fidelity when varying the complexity (see Fig. 8). However, when we let the complexity of the explanations grow (very large), the fidelity also increases. This is in line with what we would expect: because of the data sparsity and many features being relevant in the model, more complex explanations (larger rule sets) explain a larger fraction of the predictions, resulting in a higher fidelity as shown for the Flickr data in Fig. 8.

Fig. 8
figure 8

Fidelity on validation set of explanation rules for varying complexity settings (tree depths ranging from 1 to 200) when explaining the LR model on the Flickr data

In order to better understand what drives some of the differences in the performance of explanations with FG features and DDMF, we conjecture this relates to the information held at and the coverage of the most predictive featuresFootnote 24. We look at the Gini impurity reduction (used by the CART algorithm) for different featuresFootnote 25, which we plot in Figs. 17 and 18 in “Appendix 6”. The results (visually) indicate that the ratio in Gini impurity reduction of the top-ranked metafeatures and the FG features relate to the difference in fidelity between rules using FG and DDMF features. For example, consider again the Flickr data, for which the explanation rules with metafeatures achieve a fidelity of 20.46% higher compared to the FG rules. From Fig. 18c we observe that the top-DDMF holds much more information (larger Gini impurity reduction) than the top-FG feature, which might explain the large fidelity difference between the explanations. Indeed, the correlation coefficient between this ratio (impurity reduction of top-ranked DDMF vs top-ranked FG) and the difference in fidelity between explanations with DDMF vs FG (from Table 3) is 0.81.

Secondly, moving to the stability of the explanations, we observe from Table 3 that the rules with DDMF are similar in stability compared to the FG features (5 wins versus 4). The difference in stability is not statistically significant. It is important to note that for the DDMF-based explanations, there are two sources of instability: computing metafeatures from different bootstrap samples, and extracting explanation rules for different bootstrap samples. When we would “fix” the data-driven metafeatures, and not compute them for different bootstrap samples, the stability of the DDMF-based explanations increases, and is comparable to the DomainMF-based explanations. Furthermore, the stability results can be closely tied to the parameter k. When the optimal dimensionality k of the DDMF is low (for example Movielens1m and Tafeng), the same DDMF are likely to appear in the global explanation, resulting in more stable explanations over the bootstrap samples. When the selected value of k is higher (for example 20news and Airline), the stability of the explanations with DDMF decreases.

Thirdly, when we compare the accuracy between the rules with DDMF and FG, we observe that the metafeatures-based explanations result in more accurate predictions in regard to the true labels \(\mathbf{y}\) (7 wins versus 2). However, using a Wilcoxon signed-rank test, we find that the difference in accuracy is not significant at a \(5\%\) level. One data set that stands out is Libimseti. For this data set, the fidelity and accuracy for DDMF-based explanations as compared to FG-based explanations is respectively better and worse. Stronger even: the accuracies of the explanations with FG, DDMF and DDMF-SVD (94.38%, 87.94% and 99.63%) are better compared to the accuracy of the black-box model (82.71% in Table 2). Despite the sparsity of this data, there are features that have a large coverage and that are very predictive in regard to the (predicted) target values. For Libimseti, there exists a prediction model that has a small number of features (e.g., tree-based model with a depth of 5) that is more accurate compared to models on the full set of behavioral features. As a consequence, this seems not to be a problem instance that requires post-hoc explanations using rule-extraction. This example illustrates that one should always carefully verify first that there are black-box models on the full behavioral or textual data that, indeed, outperform intrinsically-interpretable models (e.g., small decision trees or linear models with a small number of features). If not, it may not help to use a black-box model and then compute post-hoc explanations from it (Rudin, 2019). Leaving out the Libimseti data and performing the Wilcoxon test on the eight remaining data sets, we find that the differences in fidelity and accuracy between DDMF and FG explanations are statistically significant at a 5% level.

Instead of generating metafeatures using a data-driven method, we can also rely on domain-based metafeatures, crafted by experts. The prominent advantage of this approach is that the resulting metafeatures are (by design) comprehensible. However, they may not always be available. For example, we have such metafeatures for only two of the nine data sets: Facebook and Movielens1m. When comparing DDMF with domain-based metafeatures for these two data sets, we see again that the fidelity is higher for the DDMF compared to the DomainMF (Table 3 shows that the rules with domain-based metafeatures achieve, at best, test fidelities of \(77.66\%\) for Facebook and \(71.47\%\) for Movielens1m), providing further support for using DDMF when developing global explanations for black-boxes. However, when a straightforward semantic meaning of the metafeatures is key, one might still prefer to use DomainMF if they can also increase the fidelity relative to the explanation with FG features (for example for the Facebook data).

Finally, we also generate explanations for Random Forest models, for which the performance results are shown in Table 7 in “Appendix 7”. We find similar results when explaining RF models compared to explaining LR models, which increases the generalizability of our experimental findings, and further supports using metafeatures to explain models on behavioral and textual data. In the experiments, we also compute explanations with data-driven metafeatures based on the SVD approach. For both the LR models and RF models, the results in Tables 3 and 7 indicate that, overall, the explanations with DDMF-SVD also have a higher fidelity and accuracy than FG-based explanations, but the differences are slightly less prominent compared to the DDMF-based explanations using NMF.

6.2 How does the fidelity of explanation rules vary for different complexity settings?

Fig. 9
figure 9

Difference in average test fidelity of rules with DDMF and FG features in percentage points for varying complexity settings (tree depths from 1 to 5)

Figure 9 plots the difference in average test fidelity between rules with DDMF and rules with FG features against the maximum allowed explanation complexity (equivalent to the tree depth). Points above the horizontal line are data sets for which the rules with DDMF perform better. The graph clearly shows that for the majority of data, and for varying complexity settings, the DDMF representation performs better than the FG (differences larger than 0). Only for the Tafeng, Airline and 20news data, the differences are sometimes not positive, indicating that for these complexity settings, the average test fidelity for the rules with FG features is best. In general, from this plot, we can conclude that the findings of Table 3 hold for varying complexity settings, and that the fidelity is generally higher for explanations with the DDMF representation compared to the FG representation.

Figure 10 plots the average test fidelity against the maximal allowed explanation complexity for FG (10a) and DDMF (10b) explanation rules. We observe that, as one would expect, for all data sets, there is generally an increasing fidelity when we increase the depth of the decision tree, or equivalently, the complexity of the explanation rules. Interestingly, for some data sets, this fidelity-complexity trade-off is less severe. For example, for the 20news and Movielens100 data, the slopes of the curves are relatively flat. These results also indicate that in some cases, there may not be much to gain by using a relatively “more complex” explanation. Therefore, once one is willing to trade-off fidelity for complexity, in some cases, one might as well choose an “extremely” simple explanation. For the 20news data, we already pointed at the f-fidelity being a more suitable measure than fidelity because of the class imbalance, which explains the marginal increase in fidelity when increasing complexity.

Fig. 10
figure 10

Average test fidelity of rules with (a) FG features and (b) DDMF for varying complexity settings (tree depths from 1 to 5)

6.3 How does the number of generated data-driven metafeatures (dimensionality reduction parameter k) impact fidelity and stability?

A key parameter in our metafeatures-based rule-extraction methodology is the dimensionality reduction parameter k. For DomainMF, this k is fixed. For DDMF, we have been selecting the value of k that maximizes the fidelity of the explanation rules on the validation data. As this k may be an important parameter that defines the dimensionality of the space where rule-extraction methods operate (and their performance), we also investigate to what extent the quality—both fidelity and stability—of rules extracted using DDMF depends on this parameter.Footnote 26 Although fidelity can be considered the most important evaluation metric, in practice, one may wish to tune parameters such as k on a desired combination of fidelity, stability and accuracyFootnote 27 depending on the context.

Fig. 11
figure 11

Average test fidelity and stability for rules with DDMF for varying values of k (number of metafeatures) and complexities for data sets Movielens100, Movielens1m, Airline, Facebook, and Yahoomovies

Fig. 12
figure 12

Average test fidelity and stability for rules with DDMF for varying values of k (number of metafeatures) and complexities for data sets Tafeng, 20news, Libimseti and Flickr

Figures 11 and 12 show the average fidelity and the stability for varying values of k used and explanation complexities. Firstly, looking at the figures depicted on the left, we observe that, for most data, the fidelity increases with a higher number of metafeatures up until a certain point, after which fidelity decreases again. This turnover point varies per data set, and also depends on the complexity of the explanation (the tree depth). Therefore, an important implication is to select the optimal number of metafeatures on a separate validation set, as we also do. Interestingly, fidelity behaves similarly to how (out-of-sample) accuracy typically does as complexity increases: for both measures there is some sort of “overfitting” to the black-box training data in case of including too many metafeatures.

On the other hand, for stability (all figures shown on the right), we observe that, overall, the stability of the extracted rules decreases with a higher number of k, especially when allowing for a larger explanation complexity. For example, the stability of rules with DDMF with \(k\) \(=\) \(10\) is generally larger than the stability with DDMF with \(k\) \(=\) \(700\). For a lower value of k, the dimensionality and sparsity of the metafeatures are lower, making the same metafeature more likely to appear in explanations from different bootstrap samples (as also explained in Sect. 6.1). Interestingly, these figures also show that there is a fidelity-stability trade-off. While fidelity generally increases (at first) with the number of metafeatures and the explanation complexity, stability does not. This may also impact the “optimal” number of generated metafeatures k, or any parameter selection for any explanation methodology.

7 Conclusion

The fine-grained level of the features that are typically observed in behavioral and textual data sets are of great value for predictive modeling. Feature selection methods or dimensionality reduction techniques to come to a reduced set of “metafeatures” have been shown in the literature to lead to lower accuracies (Junqué de Fortuny et al., 2013; Clark and Provost, 2015; De Cnudde et al., 2019) for models mining these types of data. On the other hand, we have shown empirically using a number of data sets, and for Logistic Regression and Random Forest models as black-boxes, that these metafeatures are of great value to explain the complex prediction models built on the fine-grained features. The results indicate that explanation rules extracted with data-driven metafeatures are better able to mimic the black-box models than those extracted using the behavioral or textual features on which the model was trained. As such, metafeatures help to improve the fidelity: concise rule sets that explain a large(r) percentage of the black-box’s predictions (higher fidelity) can be obtained.

Exploring when our solution of metafeatures-based rule extraction works best, our empirical results show a strong indication that the relative gain of using metafeatures to extract explanations is positively related to the sparsity of the most important fine-grained predictors in the model. When the black-box model is not characterized by the problems we try to address (high dimensionality, sparsity, and many relevant predictors for the classification task at hand), explanation rules with metafeatures will be as good as or worse than explanations with the original features. However, they can still provide the user with different types of insights on the model’s behavior, that would not (as easily) be identified when looking at rules extracted with the original data, rendering them also valuable in this context.

Our empirical results also show important trade-offs between the quality measures of the explanation rules that we considered. For example, more complex explanations (larger rule sets) tend to lead to higher fidelity but lower stability. An interesting implication of our empirical findings is that one should carefully fine tune any parameter of their explainability method, such as the number of generated metafeatures in our methodology, in order to obtain the desired trade-offs. In our case, increasing the number of generated metafeatures has shown to result in lower stability of the extracted rules, whereas the impact on fidelity is not straightforward and depends on the data set and the complexity setting.

In this article, we mainly focused on the fidelity of explanation rules in regard to the black-box model. For future research, there are some other important directions to explore for evaluating post-hoc explanations of prediction models: the computational cost to achieve the explanations, the cost of having an explanation rule set with an accuracy that is lower than the black-box model, or the issue of presenting only one rule set as explanation, while other rule sets with similar fidelity and accuracy might exist. Although these aspects are implicitly addressed in our article, a more qualitative study on how these “costs” are perceived by users can be another interesting issue for future research. On a methodological level, this study could spur future research on the use of other feature engineering techniques such as embeddings to be used in metafeatures-based explanation rules. One interesting approach is to include the fidelity, stability, accuracy, and complexity measures explicitly when constructing the metafeatures.

Finally, our metafeatures-based explanation approach for high-dimensional, sparse behavioral and textual data has important practical implications for any setting where such data is available and explainability is an important requirement, be it for model acceptance, validation, insight, or improvement. This article could therefore potentially lead to a wider use of valuable behavioral and textual data in different domains, among others, marketing and fraud detection.