1 Introduction

Because of the growing diversity of data science applications, machine learning methods must adapt to a variety of complicated structured data, from which it is often difficult to obtain typical numerical vector representations of input objects. A standard approach to modeling structured data is to employ graphs. For example, graph-based representations are prevalent in domains such as chemo- and bio- informatics. In this study, we particularly focus on the case in which a data instance is represented as a pair of a graph and its associated class label.

Although numerous machine learning methods explicitly or implicitly depend on how to measure differences between input objects, defining an appropriate distance metric on graphs remains a controversial issue in the field. A widely accepted approach is the graph kernel (Gärtner et al. 2003; Vishwanathan et al. 2010), which enables to apply machine learning methods to graph data without requiring explicit vector representations. Another popular approach would be to use neural networks (Atwood and Towsley 2016; Narayanan et al. 2017), from which a suitable representation can be learned while avoiding to explicitly define a metric. However, in these approaches, it is difficult to create a metric that explicitly extracts significant sub-structures, i.e., subgraphs. Identifying discriminative subgraphs in an interpretable manner can be insightful for many graph classification tasks. In particular, graph representation is prevalent in scientific data analysis. For example, chemical compounds are often represented by graphs; thus, finding subgraphs that have a strong effect on a target label (e.g., toxicity) is informative. Other examples of graph representations are protein 3D structures and crystalline substances (e.g., Brinda and Vishveshwara 2005; Xie and Grossman 2018), where the automatic identification of important sub-structures is expected to provide an insight behind correlation between structures and target labels. Further details of the previous studies are discussed in Sect. 2.

We propose a supervised method that obtains a metric for graphs, thereby achieving both high predictive performance and interpretability. Our method, named interpretable graph metric learning (IGML), combines the concept of metric learning (e.g., Weinberger and Saul 2009; Davis et al. 2007) with a subgraph representation, where each graph is represented by a set of its subgraphs. IGML optimizes a metric that assigns a weight \(m_{i(H)} \ge 0\) to each subgraph H contained in a given graph G. Let \(\phi _H(G)\) be a feature of the graph G that is monotonically non-decreasing with respect to the frequency of subgraph H of G.

Note that we assume that subgraphs are counted without overlapped vertices and edges throughout the study. We consider the following squared distance between two graphs G and \(G'\):

$$\begin{aligned} d_{\varvec{m}}(G,G^\prime ) :=\sum _{H \in {{\mathcal {G}}}} m_{i(H)} \left( \phi _H(G) - \phi _H(G^\prime ) \right) ^2, \end{aligned}$$
(1)

where \({{\mathcal {G}}}\) is the set of all connected graphs. Although it is known that the subgraph approach has strong graph representation capability (e.g. Gärtner et al. 2003), naïve calculation is obviously infeasible unless the weight parameters have some special structure.

We formulate IGML as a supervised learning problem of the distance function (1) using a pairwise loss function of metric learning (Davis et al. 2007) with a sparse penalty on \(m_{i(H)}\). The resulting optimization problem is computationally infeasible at a glance, because the number of weight parameters is equal to the number of possible subgraphs, which is usually intractable. We overcome this difficulty by introducing safe screening (Ghaoui et al. 2010) and working set selection (Fan et al. 2008) approaches. Both of these approaches can significantly reduce the number of variables, and further, they can be combined with a pruning strategy on the tree traverse of graph mining. These optimization tricks are inspired by two recent studies (Nakagawa et al. 2016) and (Morvan and Vert 2018), which developed safe screening- and working set- based pruning for a linear prediction model with the LASSO penalty, respectively. By combining these two techniques, we constructed a path-wise optimization method that can obtain a sparse solution of the weight parameter \(m_{i(H)}\) without directly enumerating all possible subgraphs.

To the best of our knowledge, no previous studies can provide an interpretable subgraph-based metric learned in a supervised manner. The advantages of IGML can be summarized as follows:

  • Because IGML is formulated as a convex optimization problem, the global optimal can be found by the standard gradient-based optimization.

  • The safe screening- and working set-based optimization algorithms make our problem practically tractable without sacrificing optimality.

  • We can identify a small number of important subgraphs that discriminate different classes.This implies that the resulting metric is easy to compute and highly interpretable, making it useful for a variety of subsequent data analyses.For example, applying the nearest neighbor classification or decision tree on the learned space would be effective.

Moreover, we propose three extensions of IGML. First, we show that IGML is directly applicable to other structured data, such as itemset and sequence data. Second, its application to a triplet based loss function is discussed. Third, we extend IGML to allow similarity information of vertex-labels to be incorporated. We empirically verify the superior or comparable prediction performance of IGML to other existing graph classification methods (most of which are not interpretable). We also show some examples of extracted subgraphs and data analyses on the learned metric space.

The reminder of this paper is organized as follows. In Sect. 2, we review previous studies on graph data analysis. In Sect. 3, we introduce a formulation of our proposed IGML. Section 4 discusses strategies to reduce the size of the IGML optimization problem. The detailed computational procedure of IGML is described in Sect. 5. Three extensions of IGML are presented in Sect. 6. Section 7 reports our empirical evaluation of the effectiveness of IGML on several benchmark datasets.

Note that this paper is an extended version of a preliminary conference paper (Yoshida et al. 2019a). The source code of the program used in our experiments is available at https://github.com/takeuchi-lab/Learning-Interpretable-Metric-between-Graphs.

2 Related work

Kernel-based approaches have been widely studied for graph data analysis, and they can provide a metric of graph data in a reproducing kernel Hilbert space. In particular, subgraph-based graph kernels are closely related to our study. The graphlet kernel (Shervashidze et al. 2009) creates a kernel through small subgraphs with only about 3–5 vertices, which are called graphlets. The neighborhood subgraph pairwise distance kernel (Costa and Grave 2010) selects pairs of subgraphs from a graph and counts the number of pairs identical to those in another graph. The subgraph matching kernel (Kriege and Mutzel 2012) identifies common subgraphs based on cliques in the product graph of two graphs. The feature space created by these subgraph-based kernels is easy to interpret. However, because the above approaches are unsupervised, it is fundamentally impossible to eliminate subgraphs that are unnecessary for a specific target classification task. Therefore, for example, to create the entire kernel matrix of training data, all the candidate subgraphs in the data must be enumerated once, which becomes intractable even for small-sized subgraphs. In contrast, we consider dynamically “pruning” unnecessary subgraphs through a supervised formulation of metric learning. As we will demonstrate in our later experimental results, this significantly reduces the enumeration cost, allowing our proposed algorithm to deal with the larger size of subgraphs than the simple subgraph based kernels.

There are many other kernels including the shortest path (Borgwardt and Kriegel 2005)-, random walk (Vishwanathan et al. 2010; Sugiyama and Borgwardt 2015; Zhang et al. 2018b)-, and spectrum-based (Kondor and Borgwardt 2008; Kondor et al. 2009; Kondor and Pan 2016; Verma and Zhang 2017) approaches. The Weisfeiler–Lehman (WL) kernel (Shervashidze and Borgwardt 2009; Shervashidze et al. 2011), which is based on the graph isomorphism test, is a popular and empirically successful kernel that has been employed in many studies (Yanardag and Vishwanathan 2015; Niepert et al. 2016; Narayanan et al. 2017; Zhang et al. 2018a). Again, all such approaches are unsupervised, and it is difficult to interpret results from the perspective of sub-structures of a graph. Although several kernels deal with continuous attributes on vertices (Feragen et al. 2013; Orsini et al. 2015; Su et al. 2016; Morris et al. 2016), we only focus on the cases where vertex-labels are discrete due to the associated interpretability.

Because obtaining a good metric is an essential task in data analysis, metric learning has been extensively studied to date, as reviewed in (Li and Tian 2018). However, due to its computational difficulty, metric learning for graph data has not been widely studied. A few studies have considered the edit distance approaches. For example, Bellet et al. (2012) presented a method for learning a similarity function through an edit distance in a supervised manner. Another approach probabilistically formulates the editing process of the graph and estimates the parameters using labeled data Neuhaus and Bunke (2007). However, these approaches cannot provide any clear interpretation of the resulting metric in term of the subgraphs.

Likewise, the deep neural network (DNN) is a standard approach to graph data analysis. The deep graph kernel (Yanardag and Vishwanathan 2015) incorporates neural language modeling, where decomposed sub-structures of a graph are regarded as sentences. The PATCHY-SAN (Niepert et al. 2016) and DGCNN (Zhang et al. 2018a) convert a graph to a tensor by using the WL-Kernel and convolute it. Several other studies also have combined popular convolution techniques with graph data (Tixier et al. 2018; Atwood and Towsley 2016; Simonovsky and Komodakis 2017). These approaches are supervised, but the interpretability of these DNNs is obviously relatively low. Attention enhances the interpretability of deep learning, but extracting important subgraphs is difficult because attention algorithms for graphs (Lee et al. 2018) only provides the significance of vertex transition on a graph. Another related DNN approach is representation learning. For example, sub2vec (Adhikari et al. 2018) and graph2vec (Narayanan et al. 2017) can embed graph data into a continuous space, but they are unsupervised, and it is difficult to extract substructures that characterize different classes. There are other fingerprint learning methods for graphs by neural networks (e.g. Duvenaud et al. 2015) where the contribution from each node can be evaluated for each dimension of the fingerprint. Although it is possible to highlight sub-structures for the given input graph, this does not produce important common subgraphs for prediction.

Supervised pattern mining (Cheng et al. 2008; Novak et al. 2009; Thoma et al. 2010) can be used for identifying important subgraphs by enumerating patterns with some discriminative score. However, these approaches usually 1) employ a greedy strategy to add a pattern for which global optimality cannot be guaranteed, and 2) do not optimize a metric or representation. A few other studies (Saigo et al. 2009; Nakagawa et al. 2016) have considered optimizing a linear model on the subgraph features with the LASSO penalty using graph mining. A common idea of these two methods is to traverse a graph mining tree with pruning strategies derived from optimality conditions. Saigo et al. (2009) employed a boosting-based approach, which adds a subgraph that violates the optimality condition most severely at every iteration. It was shown that the maximum violation condition can be efficiently identified by pruning the tree without losing the final solution optimality. Nakagawa et al. (2016) derived a pruning criterion by extending safe screening (Ghaoui et al. 2010), which can safely eliminate unnecessary features before solving the optimization problem. This approach can also avoid enumerating the entire tree while guaranteeing the optimality, and its efficiency compared with the boosting-based approach was demonstrated empirically, mainly because it requires much fewer tree traversals. Further, Morvan and Vert (2018) proposed a similar pruning extension of working set selection for optimizing a higher-order interaction model. Although this paper was not for the graph data, the technique is applicable to the same subgraph-based linear model as in (Saigo et al. 2009) and (Nakagawa et al. 2016). Working set selection is a heuristic feature subset selection strategy that has been widely used in machine learning algorithms, such as support vector machines (e.g., Hsu and Lin 2002). Unlike safe screening, this heuristic selection may eliminate necessary features in the middle of the optimization, but the optimality of the final solution can be guaranteed by iterating subset selection repeatedly until the solution converges. However, these methods can only optimize a linear prediction model. In this study, we focus on metric learning of graphs. Therefore, unlike the above mentioned pruning based learning methods, our aim is to learn a “distance function”. In metric learning, a distance function is typically learned from a loss function defined over a relative relation between samples (usually, pairs or triplets), by which a discriminative feature space that is generally effective for subsequent tasks, such as classification and similarity-based retrieval, is obtained. Inspired by (Nakagawa et al. 2016) and (Morvan and Vert 2018), we derive screening and pruning rules for this setting , and further, we combine them to develop an efficient algorithm.

3 Formulation of interpretable graph metric learning

3.1 Optimization problem

Suppose that the training dataset \(\{(G_i,y_i)\}_{i \in [n]}\) consists of n pairs of a graph \(G_i\) and a class label \(y_i\), where \([n] :=\{1, \ldots , n\}\). Let \({{\mathcal {G}}}\) be the set of all connected subgraphs of \(\{ G_i \}_{i \in [n]}\). In each graph, vertices and edges can be labeled. If \(H \in {{\mathcal {G}}}\) is a connected subgraph of \(G \in {{\mathcal {G}}}\), we write \(H \sqsubseteq G\). Further, let \(\#(H \sqsubseteq G)\) be the frequency of the subgraph H in G. Note that we adopt a definition of frequency that does not allow any vertices or edges among the counted subgraphs to overlap. As a representation of a graph G, we consider the following subgraph-based feature representation:

$$\begin{aligned} \phi _H(G) = g \bigl ( \#(H\sqsubseteq G) \bigr ), \text { for } H \in {{\mathcal {G}}}, \end{aligned}$$
(2)

where g is some monotonically non-decreasing and non-negative function, such as the identity function \(g(x) = x\) or indicator function \(g(x) = 1_{x>0}\), which takes the value 1 if \(x > 0\), and 0 otherwise. It is widely known that subgraph-based features can effectively represent graphs. For example, \(g(x) = x\) allows all non-isomorphic graphs to be distinguished. A similar idea was shown in (Gärtner et al. 2003) for a frequency that allows overlaps. However, this feature space is practically infeasible because the possible number of subgraphs is prohibitively large.

We focus on how to measure the distance between two graphs, which is essential for a variety of machine learning problems. We consider the following weighted squared distance between two graphs:

$$\begin{aligned} d_{\varvec{m}}(G,G^\prime ) :=\sum _{H \in {{\mathcal {G}}}} m_{i(H)} \left( \phi _H(G) - \phi _H(G^\prime ) \right) ^2, \end{aligned}$$

where i(H) is the index of the subgraph H for a weight parameter \(m_{i(H)} \ge 0\). To obtain an effective and computable distance metric, we adaptively estimate \(m_{i(H)}\) such that only a small number of important subgraphs have non-zero \(m_{i(H)}\) values.

Let \(\varvec{x}_i \in {\mathbb {R}}^p\) be the feature vector defined by concatenating \(\phi _H(G_i)\) for all \(H \in {{\mathcal {G}}}\) included in the training dataset. Then, we have

$$\begin{aligned} d_{{\varvec{m}}}({\varvec{x}}_i, {\varvec{x}}_j) = ({\varvec{x}}_i-{\varvec{x}}_j)^\top \mathrm {diag}({\varvec{m}}) ({\varvec{x}}_i-{\varvec{x}}_j) ={\varvec{m}}^\top {\varvec{c}}_{ij}, \end{aligned}$$

where \({\varvec{m}}\in {\mathbb {R}}_+^p\) is a vector of \(m_{i(H)}\), and \({\varvec{c}}_{ij} \in {\mathbb {R}}^p\) is defined as \({\varvec{c}}_{ij}:=({\varvec{x}}_i-{\varvec{x}}_j)\circ ({\varvec{x}}_i-{\varvec{x}}_j)\) with the element-wise product \(\circ \).

Let \({\mathcal {S}}_i \subseteq [n]\) and \({\mathcal {D}}_i \subseteq [n]\) be the subsets of indices that are in the same and different classes to \(\varvec{x}_i\), respectively. For each of these sets, we select the K most similar inputs to \(\varvec{x}_i\) by using some default metric, such as the graph kernel (further details are presented in Sect. 3.2). As a loss function for \(\varvec{x}_i\), we consider

$$\begin{aligned} \ell _i(\varvec{m} ; L, U) :=\sum _{l\in {\mathcal {D}}_i}\ell _L({\varvec{m}}^\top {\varvec{c}}_{il})+\sum _{j\in {\mathcal {S}}_i}\ell _{-U}(-{\varvec{m}}^\top {\varvec{c}}_{ij}), \end{aligned}$$
(3)

where \(L, U \in {\mathbb {R}}_+\) are constant parameters satisfying \(U \le L\), and \(\ell _t(x) = [t-x]_+^2\) is the standard squared hinge loss function with threshold \(t \in {\mathbb {R}}\). This loss function is a variant of the pairwise loss functions used in metric learning (Davis et al. 2007). The first term in the loss function yields a penalty if \(\varvec{x}_i\) and \(\varvec{x}_l\) are closer than L for \(l \in {{\mathcal {D}}}_i\), and the second term yields a penalty if \(\varvec{x}_i\) and \(\varvec{x}_j\) are more distant than U for \(j \in {{\mathcal {S}}}_i\).

Let \(R({\varvec{m}})=\Vert {\varvec{m}}\Vert _1+\frac{\eta }{2}\Vert {\varvec{m}}\Vert _2^2={\varvec{m}}^\top \varvec{1}+\frac{\eta }{2}\Vert {\varvec{m}}\Vert _2^2\) be an elastic-net type sparsity-inducing penalty, where \(\eta \ge 0\) is a non-negative parameter. We define our proposed IGML (interpretable graph metric learning) as the following regularized loss minimization problem:

$$\begin{aligned} \begin{aligned} \min _{{\varvec{m}}\ge {\varvec{0}}} P_{\lambda }({\varvec{m}}) :=&\sum _{i\in [n]} \ell _i(\varvec{m} ; L, U) + \lambda R({\varvec{m}}), \end{aligned} \end{aligned}$$
(4)

where \(\lambda > 0\) is the regularization parameter. The solution of this problem can provide not only a discriminative metric but also insight into important subgraphs because the sparse penalty is expected to select only a small number of non-zero parameters.

Let \(\varvec{\alpha }\in {\mathbb {R}}^{2nK}_+\) be the vector of dual variables where \(\alpha _{il}\) and \(\alpha _{ij}\) for \(i \in [n], l \in {{\mathcal {D}}}_i\), and \(j \in {{\mathcal {S}}}_i\) are concatenated. The dual problem of (4) is written as follows (see Appendix A for derivation):

$$\begin{aligned} \max _{{\varvec{\alpha }}\ge \varvec{0}} D_\lambda ({\varvec{\alpha }}):=-\frac{1}{4}\Vert {\varvec{\alpha }}\Vert _2^2+{\varvec{t}}^\top {\varvec{\alpha }} -\frac{\lambda \eta }{2}\Vert {\varvec{m}}_{\lambda } ({\varvec{\alpha }})\Vert _2^2, \end{aligned}$$
(5)

where

$$\begin{aligned} {\varvec{m}}_\lambda ({\varvec{\alpha }}):=\frac{1}{\lambda \eta }[{\varvec{C}}{\varvec{\alpha }}-\lambda {\varvec{1}}]_+, \end{aligned}$$
(6)

\({\varvec{t}}:=[L,\ldots ,L,-U,\ldots ,-U]^\top \in {\mathbb {R}}^{2nK}\) and \({\varvec{C}}:=[\ldots ,{\varvec{c}}_{il},\ldots ,\) \(-{\varvec{c}}_{ij}, \ldots ] \in {\mathbb {R}}^{p\times 2nK}\). Then, from the optimality condition, we obtain the following relationship between the primal and dual variables:

$$\begin{aligned} \alpha _{il} =-\ell '_L({\varvec{m}}^\top {\varvec{c}}_{il}), ~ \alpha _{ij} =-\ell '_{-U}(-{\varvec{m}}^\top {\varvec{c}}_{ij}), \end{aligned}$$
(7)

where \(\ell _t'(x)=-2[t-x]_+\) is the derivative of \(\ell _t\). When the regularization parameter \(\lambda \) is larger than certain \(\lambda _{\max }\), the optimal solution is \({\varvec{m}} = {\varvec{0}}\). Then, the optimal dual variables are \(\alpha _{il}=-\ell _L'(0)=2L\) and \(\alpha _{ij}=-\ell _{-U}'(0)=0\). By substituting these equations into (6), we obtain \(\lambda _{\max }\) as

$$\begin{aligned} \lambda _{\max } = \max _{k}{\varvec{C}}_{k,:}{\varvec{\alpha }}. \end{aligned}$$
(8)

3.2 Selection of \({{\mathcal {S}}}_i\) and \({{\mathcal {D}}}_i\)

For \(K = |{{\mathcal {S}}}_i| = |{{\mathcal {D}}}_i|\), in the experiments reported later, we employed the small number \(K = 10\) and used a graph kernel to select samples in \({{\mathcal {S}}}_i\) and \({{\mathcal {D}}}_i\). Although we simply used a pre-determined kernel, selecting the kernel (or its parameter) through cross-validation beforehand is also possible. Using only a small number of neighbors is a common setting in metric learning. For example, Davis et al. (2007), which is a seminal work on the pairwise approach, only used \(20c^2\) pairs in total, where c is the number of classes. A small K setting has two aims. First, particularly \({{\mathcal {S}}}_i\), adding pairs that are too far apart can be avoided under this setting. Even for a pair of samples with the same labels, enforcing such distant pairs to be close may cause over-fitting (e.g., when the sample is an outlier). Second, a small K reduces the computational cost. Because the number of pairs is \(O(n^2)\), adding all of them into the loss term requires a large computational cost. In fact, these two issues are not only for the pairwise formulation but also for other relative loss functions such as the standard triplet loss, for which there exist \(O(n^3)\) triplets. One potential difficulty in selecting \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) is the discrepancy between the initial and the optimal metric. The loss function is defined through \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\), which are selected based on the neighbors in the initial metric, but the optimization of the metric may change the nearest neighbors of each sample. A possible remedy for this problem is to adaptively change \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) in accordance with the updated metric (Takeuchi and Sugiyama 2011), though the resulting optimality of this approach is still not known. To the best of our knowledge, this is still an open problem in metric learning, which we consider beyond the scope of this paper. In the experiments (Sect. 7), we show that a nearest-neighbor classifier in the learned metric with this heuristics selection of \({{\mathcal {D}}}_i\) and \({{\mathcal {S}}}_i\) shows better or comparable performance to standard graph classification methods, such as a graph neural network.

4 Creating a tractable sub-problem

Because the problems of (4) and (5) are convex, the local solution is equivalent to the global optimal. However, naïvely solving these problems is computationally intractable because of the high dimensionality of \(\varvec{m}\). In this section, we introduce several useful rules for restricting candidate subgraphs while maintaining the optimality of the final solution. Note that the proofs for all the lemmas and theorems are provided in the appendix.

To make the optimization problem tractable, we work with only a small subset of features during the optimization process. Let \({{\mathcal {F}}}\subseteq [p]\) be a subset of features. By fixing \(m_i = 0\) for \(i \notin {{\mathcal {F}}}\), we define sub-problems of the original primal \(P_{\lambda }\) and dual \(D_{\lambda }\) problems as follows:

$$\begin{aligned}&\min _{{{\varvec{m}}}_{{\mathcal {F}}}\ge {\varvec{0}}} P_{\lambda }^{{\mathcal {F}}}({{\varvec{m}}}_{{\mathcal {F}}}) :=\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\ell _L({{\varvec{m}}}_{{\mathcal {F}}}^\top {{\varvec{c}}_{il}}_{{\mathcal {F}}}) +\sum _{j\in {\mathcal {S}}_i}\ell _{-U}(-{{\varvec{m}}}_{{\mathcal {F}}}^\top {{\varvec{c}}_{ij}}_{{\mathcal {F}}}) \right] +\lambda R({{\varvec{m}}}_{{\mathcal {F}}}), \end{aligned}$$
(9)
$$\begin{aligned}&\max _{{\varvec{\alpha }}\ge \varvec{0}} D_\lambda ^{{\mathcal {F}}}({\varvec{\alpha }}) :=-\frac{1}{4}\Vert {\varvec{\alpha }}\Vert _2^2+{\varvec{t}}^\top {\varvec{\alpha }} -\frac{\lambda \eta }{2}\Vert {\varvec{m}}_{\lambda } ({\varvec{\alpha }})_{{\mathcal {F}}}\Vert _2^2, \end{aligned}$$
(10)

where \(\varvec{m}_{{{\mathcal {F}}}}\), \(\varvec{c}_{ij{{\mathcal {F}}}}\), and \(\varvec{m}_{\lambda } ({\varvec{\alpha }})_{{\mathcal {F}}}\) are sub-vectors specified by \({{\mathcal {F}}}\). If the size of \({{\mathcal {F}}}\) is moderate, these sub-problems are significantly computationally easier to solve than the original problems.

We introduce several criteria that determine whether the feature k should be included in \({{\mathcal {F}}}\) using the techniques of safe screening (Ghaoui et al. 2010) and working set selection (Fan et al. 2008). A general form of our criteria can be written as

$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{q}}+r\Vert {\varvec{C}}_{k,:}\Vert _2 \le T, \end{aligned}$$
(11)

where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\), \(r\ge 0\), and \(T \in {\mathbb {R}}\) are constants that assume different values for each criterion. If this inequality holds for k, we exclude the k-th feature from \({{\mathcal {F}}}\). An important property is that although our algorithm only solves these small sub-problems, we can guarantee the optimality of the final solution, as shown later.

Fig. 1
figure 1

Schematic illustration of a tree and features

However, selecting \({{\mathcal {F}}}\) itself is computationally expensive because the evaluation of (11) requires O(n) computations for each k. Thus, we exploit a tree structure of graphs for determining \({{\mathcal {F}}}\). Figure 1 shows an example of such a tree, which can be constructed by a graph mining algorithm, such as gSpan (Yan and Han 2002). Suppose that the k-th node corresponds to the k-th dimension of \(\varvec{x}\) (note that the node index here is not the order of the visit). If a graph corresponding to the k-th node is a subgraph of the \(k'\)-th node, the node \(k'\) is a descendant of k, which is denoted as \(k' \supseteq k\). Then, the following monotonic relation is immediately derived from the monotonicity of \(\phi _H\):

$$\begin{aligned} x_{i,k'} \le x_{i,k} \text { if } k' \supseteq k. \end{aligned}$$
(12)

Because any parent node is a subgraph of its children in the gSpan tree Fig. 1, the non-overlapped frequency \(\#(H \sqsubseteq G)\) of subgraph H in G is monotonically non-increasing while descending the tree node. Then, the condition (12) is obviously satisfied because for a sequence of \(H \sqsubseteq H' \sqsubseteq H'' \sqsubseteq \cdots \) in the descending path of the tree, \(x_{i,k(H)} = \phi _H(G_i) = g(\#(H \sqsubseteq G))\) is monotonically non-increasing, where \(x_{i,k(H)}\) is a feature corresponding to H in \(G_i\). Based on this property, the following lemma enables us to prune a node during the tree traversal.

Lemma 1

Let

$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r)&:=\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}q_{il}\max \{x_{i,k},x_{l,k}\}^2 \nonumber \\&\quad +r\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}^4 +\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}^4\right] } \end{aligned}$$
(13)

be a pruning criterion. Then, if the inequality

$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r) \le T \end{aligned}$$
(14)

holds, for any descendant node \(k' \supseteq k\), the following inequality holds:

$$\begin{aligned} {\varvec{C}}_{k',:}{\varvec{q}}+r\Vert {\varvec{C}}_{k',;}\Vert _2 \le T, \end{aligned}$$

where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\) and \(r\ge 0\) are an arbitrary constant vector and scalar variable, respectively.

This lemma indicates that if the condition (14) is satisfied, we can say that none of the descendant nodes are included in \({{\mathcal {F}}}\). Assuming that the indicator function \(g(x) = 1_{x>0}\) is used in (2), a tighter bound can be obtained through the following lemma.

Lemma 2

If \(g(x) = 1_{x>0}\) is set in (2), the pruning criterion (13) can be replaced with

$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r)&:=\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}q_{il}-\sum _{j\in {\mathcal {S}}_i}q_{ij}(1-x_{j,k}) \right] \right\} \\&\quad +r\sqrt{\sum _{i\in [n]}\left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\} +\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}\right] }. \end{aligned}$$

By comparing the first terms of Lemmas 1 and 2, we see that Lemma 2 is tighter when \(g(x) = 1_{x>0}\) as follows:

$$\begin{aligned}&\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}q_{il}-\sum _{j\in {\mathcal {S}}_i}q_{ij}(1-x_{j,k})\right] \right\} \\&\quad \le \sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \sum _{l\in {\mathcal {D}}_i}q_{il}\right\} \\&\quad =\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, \sum _{l\in {\mathcal {D}}_i}q_{il}x_{i,k}\right\} \\&\quad \le \sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}\max \{q_{il}x_{l,k}, q_{il}x_{i,k}\}\\&\quad =\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}q_{il}\max \{x_{l,k}, x_{i,k}\}. \end{aligned}$$

A schematic illustration of the optimization algorithm for IGML is shown in Fig. 2 (for further details, see Sect. 5). To generate a subset of features \({\mathcal {F}}\), we first traverse the graph mining tree during which the safe screening/working set selection procedure and their pruning extensions are performed (Step1). Next, we solve the sub-problem (9) with the generated \({\mathcal {F}}\) using a standard gradient-based algorithm (Step2). Safe screening is also performed during the optimization iteration in Step2, which is referred to as dynamic screening. This further reduces the size of \({\mathcal {F}}\).

Fig. 2
figure 2

Schematic illustration of the optimization algorithm for IGML

Before moving onto detailed formulations, we summarize our rules to determine \({{\mathcal {F}}}\) in Table 1. The columns represent the different approaches to evaluating the necessity of features, i.e., safe and working set approaches. For the safe approaches, there are further ‘single \(\lambda \)’ (described in Sect. 4.1.2) and ‘range of \(\lambda \)’ (described in Sect. 4.1.3) approaches. The single \(\lambda \) approach considers safe rules for a specific \(\lambda \), while the range of \(\lambda \) approach considers safe rules that can eliminate features for a range of \(\lambda \) (not just a specific value). Both the single and range approaches are based on the bounds of the region in which the optimal solution exists, for which details are given in Sect. 4.1.1. The rows of Table 1 indicate the variation of rules to remove one specific feature and rules to prune all features in a subtree.

Table 1 Strategies to determine \({{\mathcal {F}}}\)

4.1 Safe screening

Safe screening (Ghaoui et al. 2010) was first proposed to identify unnecessary features in LASSO-type problems. Typically, this approach considers a bounded region of dual variables in which the optimal solution must exist. Then, we can eliminate dual inequality constraints that are never violated given that the solution exists in that region. The well-known Karush-Kuhn-Tucker (KKT) conditions show that this is equivalent to the elimination of primal variables that take value 0 at the optimal solution. In Sect. 4.1.1, we first derive a spherical bound for our optimal solution, and then in Sect. 4.1.2, a rule for safe screening is shown. Section 4.1.3 extends rules that are specifically useful for the regularization path calculation.

4.1.1 Sphere bound for optimal solution

The following theorem provides a hyper-sphere containing the optimal dual variable \(\varvec{\alpha }^\star \).

Theorem 1

(DGB) For any pair of \({\varvec{m}}\ge {\varvec{0}}\) and \({\varvec{\alpha }}\ge {\varvec{0}}\), the optimal dual variable \({\varvec{\alpha }}^\star \) must satisfy

$$\begin{aligned} \Vert {\varvec{\alpha }}-{\varvec{\alpha }}^\star \Vert _2^2\le 4(P_\lambda ({\varvec{m}})-D_\lambda ({\varvec{\alpha }})). \end{aligned}$$

This bound is called the duality gap bound (DGB), and the parameters \(\varvec{m}\) and \(\varvec{\alpha }\) used to construct the bound are referred to as the reference solution. This inequality reveals that the optimal \(\varvec{\alpha }^{\star }\) should be in the inside of the sphere whose center is the reference solution \(\varvec{\alpha }\) and radius is \(2 \sqrt{P_\lambda ({\varvec{m}})-D_\lambda ({\varvec{\alpha }})}\), i.e., twice the square root of the duality gap. Therefore, if the quality of the reference solution \(\varvec{m}\) and \(\varvec{\alpha }\) is better, a tighter bound can be obtained. When the duality gap is zero, meaning that \(\varvec{m}\) and \(\varvec{\alpha }\) are optimal, the radius is shrunk to zero.

If the optimal solution for \(\lambda _0\) is available as a reference solution to construct the bound for \(\lambda _1\), the following bound, called regularization path bound (RPB), can be obtained.

Theorem 2

(RPB) Let \({\varvec{\alpha }}_0^\star \) be the optimal solution for \(\lambda _0\) and \({\varvec{\alpha }}_1^\star \) be the optimal solution for \(\lambda _1\). Then,

$$\begin{aligned} \left\| {\varvec{\alpha }}_1^\star -\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2^2 \le \left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2^2. \end{aligned}$$

This inequality indicates that the optimal dual solution for \(\lambda _1\) (\(\varvec{\alpha }_1^{\star }\)) should be in the sphere whose center is \(\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \) and radius is \(\left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2\). However, RPB requires the exact solution, which is difficult to obtain in practice due to numerical errors. The relaxed RPB (RRPB) extends RPB to incorporate the approximate solution as a reference solution.

Theorem 3

(RRPB) Assuming that \({\varvec{\alpha }}_0\) satisfies \(\Vert {\varvec{\alpha }}_0-{\varvec{\alpha }}_0^\star \Vert _2\le \epsilon \), the optimal solution \({\varvec{\alpha }}_1^\star \) for \(\lambda _1\) must satisfy

$$\begin{aligned} \left\| {\varvec{\alpha }}_1^\star -\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0\right\| _2 \le \left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0\right\| _2+\Bigl (\frac{\lambda _0+\lambda _1}{2\lambda _0}+\frac{|\lambda _0-\lambda _1|}{2\lambda _0}\Bigr )\epsilon . \end{aligned}$$

In Theorem 1, the reference \(\varvec{\alpha }_0\) is only assumed to be close to \(\varvec{\alpha }_0^\star \) within the radius \(\epsilon \) instead of assuming that \(\varvec{\alpha }_0^\star \) is available. For example, \(\epsilon \) can be obtained using the DGB (Theorem 1).

Similar bounds to those derived here were previously considered for the triplet screening of metric learning on usual numerical data (Yoshida et al. 2018, 2019b). Here, we extend a similar idea to derive subgraph screening.

4.1.2 Safe screening and safe pruning rules

Theorems 1 and 3 identify the regions where the optimal solution exists using a current feasible solution \({\varvec{\alpha }}\). Further, from (6), when \({\varvec{C}}_{k,:}{\varvec{\alpha }}^\star \le \lambda \), we have \(m_k^\star =0\). This indicates that

$$\begin{aligned} \max _{{\varvec{\alpha }}\in {\mathcal {B}}}{\varvec{C}}_{k,:}{\varvec{\alpha }} \le \lambda \Rightarrow m_k^\star =0, \end{aligned}$$
(15)

where \({{\mathcal {B}}}\) is a region containing the optimal solution \({\varvec{\alpha }}^\star \), i.e., \({\varvec{\alpha }}^\star \in {\mathcal {B}}\). As we derived in Sect. 4.1.1, the sphere-shaped \({{\mathcal {B}}}\) can be constructed using feasible primal and dual solutions. By solving this maximization problem, we obtain the following safe screening (SS) rule.

Theorem 4

(SS Rule) If the optimal solution \({\varvec{\alpha }}^\star \) exists in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} -{\varvec{q}} \Vert _2^2\le r^2\}\), the following rule holds

$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{q}}+r\Vert {\varvec{C}}_{k,:}\Vert _2 \le \lambda \Rightarrow m_k^\star =0. \end{aligned}$$
(16)

Theorem 4 indicates that we can eliminate unnecessary features by evaluating the condition shown in (16). Here, the theorem is written in a general form, and in practice, \(\varvec{q}\) and r can be defined by the center and a radius of one of the sphere bounds, respectively. An important property of this rule is that it guarantees optimality, meaning that the sub-problems (9) and (10) have the exact same optimal solution to the original problem if \({{\mathcal {F}}}\) is defined through this rule. However, it is still necessary to evaluate the rule for all p features, which is currently intractable. To avoid this problem, we derive a pruning strategy on the graph mining tree, which we call the safe pruning (SP) rule.

Theorem 5

(SP Rule) If the optimal solution \({\varvec{\alpha }}^\star \) is in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} -{\varvec{q}} \Vert _2^2\le r^2, {\varvec{q}}\ge \varvec{0}\}\), the following rule holds

$$\begin{aligned} \mathrm {Prune}(k|{\varvec{q}}, r)\le \lambda \Rightarrow m_{k'}^\star =0. \text { for } \forall k' \supseteq k. \end{aligned}$$
(17)

This theorem is a direct consequence of Lemma 1. If this condition holds for a node k during the tree traversal, a subtree below that node can be pruned. This means that we can safely eliminate unnecessary subgraphs even without enumerating them. In this theorem, note that \({{\mathcal {B}}}\) has an additional non-negative constraint \(\varvec{q} \ge \varvec{0}\), but this is satisfied by all the bounds in Sect. 4.1.1 because of the non-negative constraint in the dual problem.

4.1.3 Range-based safe screening and safe pruning

The SS and SP rules apply to a fixed \(\lambda \). The range-based extension identifies an interval of \(\lambda \) for which the satisfaction of SS/SP is guaranteed. This is particularly useful for the path-wise optimization or regularization path calculation, where the problem must be solved with a sequence of \(\lambda \). We assume that the sequence is sorted in descending order, as optimization algorithms typically start from the trivial solution \(\varvec{m} = \varvec{0}\). Let \(\lambda =\lambda _1\le \lambda _0\). By combining RRPB with the rule (16), we obtain the following theorem.

Theorem 6

(Range-based Safe Screening (RSS)) For any k, the following rule holds

$$\begin{aligned} \lambda _a\le \lambda \le \lambda _0 \Rightarrow m_k^\star =0, \end{aligned}$$
(18)

where

$$\begin{aligned} \lambda _a :=\frac{\lambda _0(2\epsilon \Vert {\varvec{C}}_{k,:}\Vert _2 +\Vert {\varvec{\alpha }}_0\Vert _2\Vert {\varvec{C}}_{k,:}\Vert _2 +{\varvec{C}}_{k,:}{\varvec{\alpha }}_0)}{2\lambda _0+\Vert {\varvec{\alpha }}_0\Vert _2\Vert {\varvec{C}}_{k,:}\Vert _2 -{\varvec{C}}_{k,:}{\varvec{\alpha }}_0} . \end{aligned}$$

This rule indicates that we can safely ignore \(m_k\) for \(\lambda \in [\lambda _a, \lambda _0]\), while if \(\lambda _a > \lambda _0\), the weight \(m_k\) cannot be removed by this rule. For SP, the range-based rule can also be derived from (17).

Theorem 7

(Range-based Safe Pruning (RSP)) For any \(k' \supseteq k\), the following pruning rule holds:

$$\begin{aligned} \lambda '_a :=\frac{\lambda _0(2\epsilon b+\Vert {\varvec{\alpha }}_0\Vert _2b+a)}{2\lambda _0+\Vert {\varvec{\alpha }}_0\Vert _2b-a}\le \lambda \le \lambda _0 \Rightarrow m_{k'}^\star =0, \end{aligned}$$
(19)

where

$$\begin{aligned} a&:=\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}\max \{x_{l,k}, x_{i,k}\}^2,\\ b&:=\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}^4+\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}^4 \right] }. \end{aligned}$$

This theorem indicates that, while \(\lambda \in [\lambda _a',\lambda _0]\), we can safely remove the entire subtree of k. Analogously, if the feature vector is generated from \(g(x) = 1_{x>0}\) (i.e., binary), the following theorem holds.

Theorem 8

(Range-Based Safe Pruning (RSP) for binary feature) Assuming \(g(x) = 1_{x>0}\) in (2), a and b in Theorem 7can be replaced with

$$\begin{aligned} a&:=\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}-\sum _{j\in {\mathcal {S}}_i}{\alpha _0}_{ij}(1-x_{j,k})\right] \right\} ,\\ b&:=\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}+\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}\right] }. \end{aligned}$$

Because these constants a and b are derived from the tighter bound in Lemma 2, the obtained range becomes wider than the range in Theorem 7.

Once we calculate \(\lambda _a\) and \(\lambda '_a\) of (18) and (19) for some \(\lambda \), they are stored at each node of the tree. Subsequently, such \(\lambda _a\) and \(\lambda '_a\) can be used for the next tree traversal with different \(\lambda '\). If the conditions of (18) or (19) are satisfied, the node can be skipped (RSS) or pruned (RSP). Otherwise, we update \(\lambda _a\) and \(\lambda '_a\) by using the current reference solution.

4.2 Working set method

Safe rules are strong rules in the sense that they can completely remove features; thus, they are sometimes too conservative to fully accelerate the optimization. In contrast, the working set selection is a widely accepted heuristic approach to selecting a subset of features.

4.2.1 Working set selection and working set pruning

The working set (WS) method optimizes the problem with respect to only selected working set features. Then, if the optimality condition for the original problem is not satisfied, the working set is reselected and the optimization on the new working set restarts. This process iterates until optimality on the original problem is achieved.

Besides the safe rules, we use the following WS selection criterion, which is obtained directly from the KKT conditions:

$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{\alpha }}\le \lambda . \end{aligned}$$
(20)

If this inequality is satisfied, the k-th dimension is predicted as \(m_k^\star =0\). Hence, the working set is defined by

$$\begin{aligned} {\mathcal {W}}:=\{k\mid {\varvec{C}}_{k,:}{\varvec{\alpha }}>\lambda \}. \end{aligned}$$

Although \(m^\star _i = 0\) for \(i \notin {{\mathcal {W}}}\) is not guaranteed, the final convergence of the procedure is guaranteed by the following theorem.

Theorem 9

(Convergence of WS) Assume that there is a solver for the sub-problem (9) (or equivalently (10)) that returns the optimal solution for given \({{\mathcal {F}}}\). The working set method, which iterates optimizating the sub-problem with \({{\mathcal {F}}}= {{\mathcal {W}}}\) and updating \({{\mathcal {W}}}\) alternately, returns the optimal solution of the original problem in finite steps.

However, here again, the inequality (20) needs to be evaluated for all features, which is computationally intractable.

The same pruning strategy as for SS/SP can be incorporated into working set selection. The criterion (20) is also a special case of (11), and Lemma 1 indicates that if the following inequality

$$\begin{aligned} \mathrm {Prune}_{\mathrm{WP}}(k) :=\mathrm {Prune}(k|{\varvec{\alpha }}, 0)\le \lambda , \end{aligned}$$

holds, then no \(k' \supseteq k\) is included in the working set. We refer to this criterion as working set pruning (WP).

4.2.2 Relation with safe rules

Note that for the working set method, we may need to update \({{\mathcal {W}}}\) multiple times, unlike in the safe screening approaches, as shown by Theorem 9. Instead, the working set method can usually exclude a larger number of features compared with safe screening approaches. In fact, when the condition of the SS rule (16) is satisfied, the WS criterion (20) must likewise be satisfied. Because all the spheres (DGB, RPB and RRPB) contain the reference solution \(\varvec{\alpha }\), which is usually the current solution, the inequality

$$\begin{aligned} \varvec{C}_{k,:} \varvec{\alpha }\le \max _{\varvec{\alpha }' \in {{\mathcal {B}}}} \varvec{C}_{k,:} \varvec{\alpha }' \end{aligned}$$
(21)

holds, where \({{\mathcal {B}}}\) is a sphere created by DGB, RPB or RRPB. This indicates that when the SS rule excludes the k-th feature, the WS also excludes the k-th feature. However, to guarantee convergence, WS needs to be fixed until the sub-problem (9)–(10) is solved (Theorem 9). In contrast, the SS rule is applicable anytime during the optimization procedure without affecting the final optimality. This enables us to apply the SS rule even to the sub-problem (9)–(10), where \({{\mathcal {F}}}\) is defined by WS as shown in Step 2 of Fig. 2 (dynamic screening).

For the pruning rules, we first confirm the following two properties:

$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r)&\ge \mathrm{Prune}(k|\varvec{q},0), \\ \mathrm{Prune}(k| C \varvec{q},0)&= C ~ \mathrm{Prune}(k| \varvec{q},0), \end{aligned}$$

where \(\varvec{q} \in {\mathbb {R}}_+^{2 n K}\) is the center of the sphere, \(r \ge 0\) is the radius, and \(C \in {\mathbb {R}}\) is a constant. In the case of DGB, the center of the sphere is the reference solution \(\varvec{\alpha }\) itself, i.e., \(\varvec{q} = \varvec{\alpha }\). Then, the following relation holds between the SP criterion \(\mathrm{Prune}(k|\varvec{q},r)\) and WP criterion \(\mathrm{Prune}_{\mathrm{WP}}(k)\):

$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r) = \mathrm{Prune}(k|\varvec{\alpha }_0,r) \ge \mathrm{Prune}(k|\varvec{\alpha }_0,0) = \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$

This once more indicates that when the SP rule is satisfied, the WP rule must be satisfied as well. When the RPB or RRPB sphere is used, the center of the sphere is \(\varvec{q} = \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha }_0\). Assuming that the solution for \(\lambda _0\) is used as the reference solution, i.e., \(\varvec{\alpha }= \varvec{\alpha }_0\), we obtain

$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r)&= \mathrm{Prune} \left( k|\frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha },r\right) \\&\ge \mathrm{Prune} \left( k|\frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha },0\right) \\&= \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \mathrm{Prune}(k| \varvec{\alpha },0) \\&= \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$

Using this inequality, we obtain

$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r) - \mathrm{Prune}_{\mathrm{WP}}(k) \ge \frac{\lambda _1 - \lambda _0}{2 \lambda _0} \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$

From this inequality, if \(\lambda _1 > \lambda _0\), then \(\mathrm{Prune}(k|\varvec{q},r) > \mathrm{Prune}_{\mathrm{WP}}(k)\) (note that \(\mathrm{Prune}_{\mathrm{WP}}(k) \ge 0\) because \(\varvec{\alpha }\ge \varvec{0}\)), indicating that the pruning of WS is always tighter than that of the safe rule. However, in our algorithm presented in Sect. 5, \(\lambda _1 < \lambda _0\) holds because we start from a larger value of \(\lambda \) and gradually decrease it. Then, this inequality does not hold, and \(\mathrm{Prune}(k|\varvec{q},r) < \mathrm{Prune}_{\mathrm{WP}}(k)\) becomes a possibility.

When the WS and WP rules are strictly tighter than the SS and SP rules, respectively, using both of WS/WP and SS/SP rules is equivalent to using WS/WP only (except for dynamic screening). Even in this case, the range-based safe approaches (the RSS and RSP rules) can still be effective. When the range-based rules are evaluated, we obtain the range of \(\lambda \) such that the SS or SP rule is satisfied. Thus, as long as \(\lambda \) is in that range, we do not need to evaluate any safe or working set rules.

5 Algorithm and computations

5.1 Training with path-wise optimization

We employ path-wise optimization (Friedman et al. 2007), where the optimization starts from \(\lambda = \lambda _{\max }\), which gradually decreases \(\lambda \) while optimizing \(\varvec{m}\). As can be seen from (8), \(\lambda _{\max }\) is defined by the maximum of the inner product \(\varvec{C}_{k,:}\varvec{\alpha }\). This value can also be found by a tree search with pruning. Suppose that we calculate \(\varvec{C}_{k,:}\varvec{\alpha }\) while traversing the tree and \({\hat{\lambda }}_{\max }\) is the current maximum value during the traversal. Using Lemma 1, we can derive the pruning rule

$$\begin{aligned} \mathrm {Prune}(k|{\varvec{\alpha }},0)\le {\hat{\lambda }}_{\max }. \end{aligned}$$

If this condition holds, the descendant nodes of k cannot be maximal, and thus we can identify \(\lambda _{\max }\) without calculating \(\varvec{C}_{k,:}\varvec{\alpha }\) for all candidate features.

Algorithm 1 shows the outer loop of our path-wise optimization. The TRAVERSE and SOLVE functions in Algorithm 1 are shown in Algorithm 2 and 3, respectively. Algorithm 1 first calculates \(\lambda _{\max }\) which is the minimum \(\lambda \) at which the optimal solution is \(\varvec{m}^{\star } = \varvec{0}\) (line 3). The outer loop in lines 5-14 is the process of decreasing \(\lambda \) with the decreasing rate R. The TRAVERSE function in line 7 determines the subset of features \({{\mathcal {F}}}\) by traversing tree with SS and WS. The inner loop (line 9-13) alternately solves the optimization problem with the current \({{\mathcal {F}}}\) and updates \({{\mathcal {F}}}\) until the duality gap becomes less than the given threshold \(\mathrm{eps}\).

Algorithm 2 shows the TRAVERSE function, which recursively visits tree nodes to determine \({{\mathcal {F}}}\). The variable node.pruning contains \(\lambda '_a\) of RSP, and if the RSP condition (19) is satisfied (line 3), the function returns the current \({{\mathcal {F}}}\) (the node is pruned). The variable node.screening contains \(\lambda _a\) of RSS, and if the RSS condition (18) is satisfied (line 5), this node can be skipped, and the function proceeds to the next node. If these two conditions are not satisfied, the function 1) updates node.pruning and node.screening if update is true, and 2) evaluates the conditions of RSP and WP (line 10), and RSS and WS (line 14). In lines 17-18, gSpan expands the children of the current node, and for each child node, the TRAVERSE function is called recursively.

Algorithm 3 shows a solver for the primal problem with the subset of features \({{\mathcal {F}}}\). Although we employ a simple projected gradient algorithm, any optimization algorithm can be used in this process. In lines 7-10, the SS rule is evaluated at every after \(\mathrm{freq}\) iterations. Note that this SS is only for the sub-problems (9) and (10) created by the current \({{\mathcal {F}}}\) (not for the original problems).

figure a
figure b
figure c

5.2 Enumerating subgraphs for test data

To obtain a feature vector for test data, we only need to enumerate subgraphs with \(m_k\ne 0\). When gSpan is used as a mining algorithm, a unique code, called minimum DFS code, is assigned to each node. If a DFS code for a node is \((a_1, a_2, \ldots , a_n)\), a child node is represented by \((a_1, a_2, \ldots , a_n, a_{n+1})\). This enables us to prune nodes that does not generate subgraphs with \(m_k\ne 0\). Suppose that a subgraph \((a_1, a_2, a_3) = (x, y, z)\) must be enumerated, and that we are currently at node \((a_1) = (x)\). Then, a child with \((a_1, a_2) = (x, y)\) should be traversed, but a child with \((a_1, a_2) = (x, w)\) cannot generate (xyz), and consequently we can stop the traversal of this node.

5.3 Post-processing

5.3.1 Learning mahalanobis distance for selected subgraphs

Instead of \(\varvec{m}\), the following Mahalanobis distance can be considered

$$\begin{aligned} d_{\varvec{M}}(\varvec{x}_i,\varvec{x}_j) = (\varvec{x}_i - \varvec{x}_j)^\top \varvec{M} (\varvec{x}_i - \varvec{x}_j), \end{aligned}$$
(22)

where \(\varvec{M}\) is a positive definite matrix. Directly optimizing \(\varvec{M}\) requires \(O(p^2)\) primal variables and semi-definite constraint, making the problem computationally expensive, even for relatively small p. Thus, as optional post-processing, we consider optimizing the Mahalanobis distance (22) for a small number of subgraphs selected by the optimized \(\varvec{m}\). Let \({{\mathcal {H}}}\subseteq {{\mathcal {G}}}\) be a set of subgraphs \(m_{i(H)} > 0\) for \(H \in {{\mathcal {H}}}\) and \(\varvec{z}_i\) be a \(h :=|{{\mathcal {H}}}|\) dimensional feature vector consisting of \(\phi _H(G_i)\) for \(H \in {{\mathcal {H}}}\). For \(\varvec{M} \in {\mathbb {R}}^{h \times h}\), we consider the following metric learning problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{M} \succeq \varvec{O}}&\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\ell _L(d_{\varvec{M}}(\varvec{z}_i,\varvec{z}_l)) +\sum _{j\in {\mathcal {S}}_i}\ell _{-U}(-d_{\varvec{M}}(\varvec{z}_i,\varvec{z}_j))\right] + \lambda R(\varvec{M}). \end{aligned} \end{aligned}$$

Above, \(R: {\mathbb {R}}^{h \times h} \rightarrow {\mathbb {R}}\) is a regularization term for \(\varvec{M}\), where a typical setting is \(R(\varvec{M}) = \mathrm{tr} \varvec{M} + \frac{\eta }{2} \Vert \varvec{M} \Vert _F^2\) with \(\mathrm{tr}\) representing the trace of a matrix. This metric can be more discriminative, because it is optimized to the training data with a higher degree of freedom.

5.3.2 Vector representation of a graph

An explicit vector representation of an input graph can be obtained using optimized \({\varvec{m}}\) as follows:

$$\begin{aligned} {\varvec{x}}_i'=\sqrt{{\varvec{m}}}\circ {\varvec{x}}_i \end{aligned}$$
(23)

Unlike the original \(\varvec{x}_i\), the new representation \(\varvec{x}_i'\) is computationally tractable because of the sparsity of \(\varvec{m}\), and simultaneously, this space should be highly discriminative. This property is beneficial for further analysis of the graph data. We show an example of applying the decision tree to the learned space later in the paper.

In the case of the general Mahalanobis distance given in Sect. 5.3.1, we can obtain further transformation. Let \(\varvec{M} = \varvec{V} \varvec{\Lambda }\varvec{V}^\top \) be the eigenvalue decomposition of the learned \(\varvec{M}\). By employing the regularization term \(R(\varvec{M}) = \mathrm{tr} \varvec{M} + \frac{\eta }{2} \Vert \varvec{M} \Vert _F^2\), some of the eigenvalues of \(\varvec{M}\) can be shrunk to 0 because \(\mathrm{tr} \varvec{M}\) is equal to the sum of the eigenvalues. If \(\varvec{M}\) has \(h' < h\) non-zero eigenvalues, \(\varvec{\Lambda }\) can be written as a \(h' \times h'\) diagonal matrix, and \(\varvec{V}\) is a \(h \times h'\) matrix such that each column is the eigenvector of a non-zero eigenvalue. Then, a representation of the graph is

$$\begin{aligned} \sqrt{\varvec{\Lambda }} \varvec{V}^\top \varvec{z}_i. \end{aligned}$$
(24)

This can be considered as a supervised dimensionality reduction from h- to \(h'\)-dimensional space. Although each dimension no longer corresponds to a subgraph in this representation, the interpretation remains clear because each dimension of the transformed vector is simply a linear combination of \(\varvec{z}_i\).

6 Extensions

In this section, we consider three extensions of IGML: applications to other data types, employing a triplet loss function, and introducing vertex-label similarity.

6.1 Application to other structured data

In addition to graph data, the proposed method can be applied to itemset/sequence data . For an itemset, the Jaccard index, defined as the size of the intersection of two sets divided by the size of the union, is the most popular similarity measure. Although a few studies have considered kernels for an itemset (Zhang et al. 2007), to the best of our knowledge, it remains difficult to adapt a metric on a given labeled dataset in an interpretable manner. In contrast, there are many kernel approaches for sequence data. The spectrum kernel (Leslie et al. 2001) creates a kernel matrix by enumerating all k-length subsequences in the given sequence. The mismatch kernel (Leslie et al. 2004) enumerates subsequences allowing m discrepancies in a pattern of length k. The gappy kernel (Leslie and Kuang 2004; Kuksa et al. 2008) counts the number of k-mers (subsequences) with a certain number of gaps g that appear in the sequence. The above kernels require the value of hyperparameter k, although various lengths may in fact be related. The motif kernel (Zhang and Zaki 2006; Pissis et al. 2013; Pissis 2014) counts the number of “motifs” appearing in the input sequences, the “motif” must be decided by the user. Because these approaches are based on the idea of the ‘kernel’, they are unsupervised, unlike our approach.

By employing a similar approach to the graph input, we can construct a feature representation \(\phi _H(X_i)\) for both itemset and sequence data. For the itemset data, the i-th input is a set of items \(X_i \subseteq {{\mathcal {I}}}\), where \({{\mathcal {I}}}\) is a set of all items, e.g., \(X_1 = \{ a, b \}, X_2 = \{ b, c, e \}, \ldots \) with the candidate items \({{\mathcal {I}}}= \{a, b, c, d, e\}\). The feature \(\phi _H(X_i)\) is defined by \(1_{H \subseteq X_i}\) for \(\forall H \subseteq {{\mathcal {I}}}\). This feature \(\phi _H(X_i)\) also has monotonicity \(\phi _{H^\prime }(X_i) \le \phi _{H}(X_i)\) for \(H^\prime \supseteq H\). In sequence data, the i-th input \(X_i\) is a sequence of items. Thus, the feature \(\phi _H(X_i)\) is defined from the frequency of a sub-sequence H in the given \(X_i\). For example, if we have \(X_i = \langle b, b, a, b, a, c, d \rangle \) and \(H = \langle b, a \rangle \), then H occurs twice in \(X_i\). For sequence data, the monotonicity property is again guaranteed because \(\phi _{H^\prime }(X_i) \le \phi _{H}(X_i)\), where H is a sub-sequence of \(H^\prime \). Because of these monotonicity properties, we can apply the same pruning procedures to both of itemset and sequence data. Figure 3 shows examples of trees that can be constructed by itemset and sequence mining algorithms (Agrawal et al. 1994; Pei et al. 2001).

Fig. 3
figure 3

Schematic illustrations of trees and features for (A) itemset and (B) sequence data

6.2 Triplet loss

We formulate the loss function of IGML as the pair-wise loss (3). Triplet loss functions are also widely used in metric learning (e.g., Weinberger and Saul 2009):

$$\begin{aligned} \sum _{(i,j,l)\in {\mathcal {T}} }\ell _1({\varvec{m}}^\top {\varvec{c}}_{il}-{\varvec{m}} ^\top {\varvec{c}}_{ij}), \end{aligned}$$

where \({{\mathcal {T}}}\) is an index set of triplets consisting of (ijl) satisfying \(y_i=y_j, y_i\ne y_l\). This loss incurs a penalty when the distance between samples in the same class is larger than the distance between samples in different classes. Because the loss is defined by a ‘triplet’ of samples, this approach can be more time-consuming than the pairwise approach. In contrast, the relative evaluation such as \(d_{{\varvec{m}}}({\varvec{x}}_i,{\varvec{x}}_j) < d_{{\varvec{m}}}({\varvec{x}}_i,{\varvec{x}}_l)\) (the j-th sample must be closer to the i-th sample than the l-th sample) can capture the higher-order relations between input objects rather than penalizing the pair-wise distance.

A pruning rule can be derived even for the case of triplet loss. By defining \({\varvec{c}}_{ijl}:={\varvec{c}}_{il}-{\varvec{c}}_{ij}\), the loss function can be written as

$$\begin{aligned} \sum _{(i,j,l)\in {\mathcal {T}} }\ell _1({\varvec{m}}^\top {\varvec{c}}_{ijl}). \end{aligned}$$

Because this has the same form as pairwise loss with \(L=1\) and \(U=0\), the optimization problem is reduced to the same form as the pairwise case. We require a slight modification of Lemma 1 because of the change of the constant coefficients (i.e., from \({\varvec{c}}_{ij}\) to \({\varvec{c}}_{ijl}\)). The equation (13) is changed to

$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r):=\sum _{(i,j,l)\in {{\mathcal {T}}}}q_{ijl}\max \{x_{i,k},x_{l,k}\}^2+r\sqrt{\sum _{ijl}\max \{x_{i,k},x_{l,k}\}^4 }. \end{aligned}$$
(25)

This is easily proven using

$$\begin{aligned} c_{ijl, k'}=(x_{i,k'}-x_{l,k'})^2-(x_{i,k'}-x_{j,k'})^2\le \max \{x_{i,k}, x_{l,k}\}^2, \forall k' \supseteq k, \end{aligned}$$

which is an immediate consequence of the monotonicity inequality (12).

Fig. 4
figure 4

Dissimilarity matrix among vertex-labels

6.3 Considering vertex-label similarity

Because IGML is based on the exact matching of subgraphs to create the feature \(\phi _H(G)\), it is difficult to provide a prediction for a graph that does not exactly match many of the selected subgraphs. Typically, this happens when the test dataset has a different distribution of vertex-labels. For example, in the case of the prediction on a chemical compound group whose atomic compositions are largely different from those in the training dataset, the exact match may not be expected as in the case of the training dataset. Therefore, we consider incorporating similarity/dissimilarity information of graph vertex-labels to relax this exact matching constraint. A toy example of vertex-label dissimilarity is shown in Fig. 4. In this case, the ‘red’ vertex is similar to the ‘green’ vertex, while it is dissimilar to the ‘yellow’ vertex. For example, we can create this type of table by using prior domain knowledge (e.g., chemical properties of atoms). Even when no prior information is available, a similarity matrix can be inferred using any embedding method (e.g., Huang et al. 2017).

Fig. 5
figure 5

Re-labeling and inclusion relationship. \((X, [\,])\) is abbreviated as X, where \(X\in \{A,B,C\}\). a In each step, all vertices are re-labeled by combining a vertex-label and neighboring labels at the previous step. b Example of inclusion relationship, defined by (26) and (27). The relation \(L_P(u_2,3) \sqsubseteq L_G(v_2,3)\) is satisfied between \(u_2\) and \(v_2\) at Step3

Because it is difficult to directly incorporate similarity information into our subgraph isomorphism-based feature \(\phi _H(G)\), we first introduce a relaxed evaluation of inclusion of a graph P in a given graph G. We assume that P is obtained from the gSpan tree of the training data. Our approach is based on the idea of ‘re-labeling’ graph vertex-labels in the Weisfeiler-Lehman (WL) kernel (Shervashidze et al. 2011), which is a well-known graph kernel with an approximate graph isomorphism test. Figure 5a shows an example of the re-labeling procedure, which is performed in a fixed number of recursive steps. The number of steps is denoted as T (\(T = 3\) in the figure) and is assumed to be pre-specified. In step h, each graph vertex v has a level h hierarchical label \(L_G(v,h) :=(F^{(h)}, S^{(h)} = [S^{(h)}_1, \ldots , S^{(h)}_n])\), where \(F^{(h)}\) is recursively defined by the level \(h - 1\) hierarchical label of the same vertex, i.e., \(F^{(h)} = L_G(v,h-1)\), and \(S^{(h)}\) is a multiset created by the level \(h - 1\) hierarchical labels \(L_G(v',h-1)\) from all neighboring vertices \(v'\) connected to v. Note that a multiset, denoted by ‘[, ]’, is a set where duplicate elements are allowed. For example, in the graph G shown on the right side of Fig. 5a, the hierarchical label of the vertex \(v_1\) on level \(h = 3\) is \(L_G(v_1,3) = ((A, [B]), [(B, [A,C])])\). In this case, \(F^{(3)} = (A, [B])\), which is equal to \(L_G(v_1,2)\), and \(S^{(3)}_1 = (B, [A,C])\), which is equal to \(L_G(v_2,2)\). The original label A can also be regarded as a hierarchical label \((A,[\,])\) on the level \(h = 1\), but it is shown as ‘A’ for simplicity.

We define a relation of the inclusion ‘\(\sqsubseteq \)’ between two hierarchical labels \(L_{P}(u,h) = (F^{(h)}, S^{(h)} = [S^{(h)}_1, \ldots , S^{(h)}_m])\) and \(L_{G}(v,h) = (F^{\prime (h)}, S^{\prime (h)} = [S^{\prime (h)}_1, \ldots , S^{\prime (h)}_n])\), which originate from the two vertices u and v in graphs P and G, respectively. We say that \(L_{P}(v,h)\) is included in \(L_{G}(u,h)\) and denote it by

$$\begin{aligned} L_{P}(v,h) \sqsubseteq L_{G}(u,h) \end{aligned}$$
(26)

when the following recursive condition is satisfied:

(27)

where \(\sigma : [m] \rightarrow [n]\) is an injection from [m] to [n] (i.e., \(\sigma (i) \ne \sigma (j)\) when \(i \ne j\)), and \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)})\) indicates that there exists an injection \(\sigma \) that satisfies \(S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)}\) for \(\forall i \in [m]\). The first condition (27a) is for the case of \(S^{(h)} = S^{\prime (h)} =[\,]\), which occurs at the first level \(h=1\), and in this case, it simply evaluates whether the two hierarchical labels are equal, i.e., \(F^{(h)} = F^{\prime (h)}\). Note that when \(h = 1\), the hierarchical label is simply (X, []), where X is one of the original vertex-labels. In the other case (27b), both of the two conditions \(F^{(h)} \sqsubseteq F^{\prime (h)}\) and \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)})\) are recursively defined. Suppose that we already evaluated the level \(h - 1\) relation \(L_{P}(u,h-1) \sqsubseteq L_{G}(v,h-1)\) for all pairs \(\forall (u,v)\) from P and G. Because \(F^{(h)} = L_{P}(u,h-1)\) and \(F^{\prime (h)} = L_{G}(v,h-1)\), the condition \(F^{(h)} \sqsubseteq F^{\prime (h)}\) is equivalent to \(L_{P}(u,h-1) \sqsubseteq L_{G}(v,h-1)\), which is assumed to be already obtained on the level \(h-1\) computation. Because \(S^{(h)}_i\) and \(S^{\prime (h)}_i\) are also from hierarchical labels on level \(h - 1\), the condition \(\exists \sigma (\wedge _{i \in [m]} S^{(h)}_i \sqsubseteq S^{\prime (h)}_{\sigma (i)})\) is also recursive. From the result of the level \(h - 1\) evaluations, we can determine whether \(S^{(h)}_i \sqsubseteq S^{\prime (h)}_{j}\) holds for \(\forall (i,j)\). Then, the evaluation of the condition \(\exists \sigma (\wedge _{i \in [n]} S^{(h)}_i \sqsubseteq S^{\prime (h)}_{\sigma (i)})\) is reduced to a matching problem from \(i \in [m]\) to \(j \in [n]\). This problem can be simply transformed into a maximum bipartite matching problem for a pair of \(\{ S^{(h)}_1, \ldots , S^{(h)}_n \}\) and \(\{ S^{\prime (h)}_1,\ldots ,S^{\prime (h)}_m \}\), where edges exist on a set of pairs \(\{ (i,j) \mid S^{(h)}_i\sqsubseteq S^{\prime (h)}_j \}\). When the maximum number of matchings is equal to m, this means that there exists an injection \(\sigma (i)\) satisfying \(\wedge _{i \in [m]} S^{(h)}_i\sqsubseteq S^{\prime (h)}_{\sigma (i)}\). It is well known that the maximum bipartite matching can be reduced to the maximum flow problem, which can be solved in the polynomial time (Goldberg and Tarjan 1988). An example of the inclusion relationship is shown in Fig. 5b.

Let |P| and |G| be the numbers of vertices in P and G, respectively. Then, multisets of the level T hierarchical labels of all the vertices in P and G are written as \([L_{P}(u_i,T)]_{i \in [|P|]} :=[L_{P}(u_1,T), L_{P}(u_2,T), \ldots , L_{P}(u_{|P|},T)]\) and \([L_{G}(v_i,T)]_{i \in [|G|]} :=[L_{G}(v_1,T), L_{G}(v_2,T), \ldots , L_{G}(v_{|G|},T)]\), respectively. For a feature of a given input graph G, we define the approximate subgraph isomorphism feature (ASIF) as follows:

$$\begin{aligned} x_{P \sqsubseteq G} :={\left\{ \begin{array}{ll} 1, &{} \text { if } \exists \sigma (\wedge _{i \in [ |P| ]} L_{P}(u_i,T) \sqsubseteq L_{G}(v_{\sigma (i)},T) ), \\ 0, &{} \text { otherwise. } \end{array}\right. } \end{aligned}$$
(28)

This feature approximately evaluates the existence of a subgraph P in G using the level T hierarchical labels. ASIF satisfies the monotone decreasing property (12), i.e., \(x_{P'\sqsubseteq G}\le x_{P\sqsubseteq G}\) if \(P' \sqsupseteq P\), because the number of conditions in (27) only increases when P grows.

To incorporate label dissimilarity information (as shown in Fig. 4) into ASIF, we first extend the label inclusion relation (26) by using the concept of optimal transportation cost. As a label similarity-based relaxed evaluation of \(L_{P}(v,h) \sqsubseteq L_{G}(u,h)\), we define an asymmetric cost between \(L_P(u,h)\) and \(L_G(v,h)\) as follows

(29)

where the second term of (29b) is

$$\begin{aligned} \mathrm {LTC}(S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h-1}) :=\min _{ \sigma \in {{\mathcal {I}}}} \sum _{i \in [m]} \mathrm {cost}_{h-1}(S^{(h)}_i \rightarrow S^{\prime (h)}_{\sigma (i)}), \end{aligned}$$
(30)

which we refer to as the label transportation cost (LTC) representing the optimal transportation from the multiset \(S^{(h)}\) to another multiset \(S^{\prime (h)}\) among the set of all injections \({{\mathcal {I}}}:=\{ \forall \sigma : [m] \rightarrow [n] \mid \sigma (i) \ne \sigma (j) \text { for } i \ne j \}\). The equation (29) has a recursive structure similar to that of (26). The first case (29a) occurs when \(S^{(h)} = S^{\prime (h)} =[\,]\), which is at the first level \(h = 1\). In this case, \(\mathrm {cost}_1\) is defined by \(\mathrm {dissimilarity}(F^{(1)},F^{\prime (1)})\), which is directly obtained as a dissimilarity between original labels since \(F^{(1)}\) and \(F^{\prime (1)}\) stem from the original vertex-labels. In the other case (29b), the cost is recursively defined as the sum of the cost from \(F^{(h)}\) to \(F^{\prime (h)}\) and the optimal-transport cost from \(S^{(h)}\) to \(S^{\prime (h)}\). Although this definition is recursive, as in the case of ASIF, the evaluation can be performed by computing sequentially from \(h = 1\) to \(h = T\). Because \(F^{(h)} = L_{P}(v,h-1)\) and \(F^{\prime (h)} = L_{G}(u,h-1)\), the first term \(\mathrm {cost}_{h-1}(F^{(h)} \rightarrow F^{\prime (h)})\) represents the cost between hierarchical labels on the level \(h-1\), which is assumed to already have been obtained. The second term \(\mathrm {LTC}(S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h-1})\) evaluates the best match between \([S^{(h)}_1, \ldots , S^{(h)}_m]\) and \([S^{\prime (h)}_1, \ldots , S^{\prime (h)}_n]\), as defined in (30). This matching problem can be seen as an optimal transportation problem, which minimizes the cost of the transportation of m items to n warehouses under the given cost matrix specified by \(\mathrm {cost}_{h-1}\). The values of \(\mathrm {cost}_{h-1}\) for all the pairs in [m] and [n] are also available from the computation at the level \(h - 1\). For the given cost values, the problem of \(\mathrm {LTC}(S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h-1})\) can be reduced to a minimum-cost-flow problem on a bipartite graph with a weight \(\mathrm {cost}_{h-1}(S^{(h)}_i \rightarrow S^{\prime (h)}_j, \mathrm {cost}_{h-1})\) between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\), which can be solved in polynomial time (Goldberg and Tarjan 1988).

We define an asymmetric transport cost for two graphs P and G, which we call the graph transportation cost (GTC), as LTC from all level T hierarchical labels of P to those of G:

$$\begin{aligned} \mathrm {GTC}(P \rightarrow G) :=\mathrm {LTC}( [L_{P}(u_i,T)]_{i \in [|P|]} \rightarrow [L_{G}(v_i,T)]_{i \in [|G|]}, \mathrm {cost}_{T} ). \end{aligned}$$

Then, as a feature of the input graph G, we define the following sim-ASIF:

$$\begin{aligned} x_{P \rightarrow G} :=\exp \{-\rho \,\mathrm {GTC}(P \rightarrow G)\}, \end{aligned}$$
(31)

where \(\rho > 0\) is a hyperparameter. This sim-ASIF can be regarded as a generalization of (28) based on the vertex-label similarity. When \(\mathrm {dissimilarity}(F^{(1)}, F^{\prime (1)}) :=\infty \times 1_{F^{(1)} \ne F^{\prime (1)}}\), the feature (31) is equivalent to (28). Similarly to ASIF, \(\mathrm {GTC}(P\rightarrow G)\) satisfies the monotonicity property

$$\begin{aligned} \mathrm {GTC}(P \rightarrow G) \le \mathrm {GTC}(P' \rightarrow G)~\text {for}~P' \sqsupseteq P \end{aligned}$$

because the number of vertices to transport increases as P grows. Therefore, sim-ASIF (31) satisfies the monotonicity property, i.e., \(x_{P'\rightarrow G}\le x_{P\rightarrow G}\) if \(P'\sqsupseteq P\).

From the definition (31), sim-ASIF always has a positive value \(x_{P \rightarrow G} > 0\) except when \(\mathrm {GTC}(P \rightarrow G) = \infty \), which may not be suitable for identifying a small number of important subgraphs. Further, in sim-ASIF, the bipartite graph in the minimum-cost-flow calculation \(\mathrm {LTC}(S \rightarrow S', \mathrm {cost}_{h-1})\) is always a complete bipartite graph, where all vertices in S are connected to all vertices in \(S'\). Because the efficiency of most of standard minimum-cost-flow algorithms depends on the number of edges, this may entail a large computational cost. As an extension to mitigate these issues, a threshold can be introduced into sim-ASIF as follows:

$$\begin{aligned} x:={\left\{ \begin{array}{ll} \exp \{-\rho \,\mathrm {GTC}(P \rightarrow G)\}, &{}\exp \{-\rho \,\mathrm {GTC}(P \rightarrow G)\}>t\\ 0, &{}\exp \{-\rho \,\mathrm {GTC}(P \rightarrow G)\}\le t \end{array}\right. }, \end{aligned}$$
(32)

where \(t > 0\) is a threshold parameter. In this definition, \(x = 0\) when \(\exp \{-\rho \,\mathrm {GTC}(P \rightarrow G)\}\le t\), i.e., \(\mathrm {GTC}(P \rightarrow G) \ge -(\log t)/\rho \). This indicates that if a cost is larger than \(-(\log t)/\rho \), we can regard the cost as \(\infty \). Therefore, at any h, if the cost between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\) is larger than \(-(\log t)/\rho \), the edge between \(S^{(h)}_i\) and \(S^{\prime (h)}_j\) is not necessary. Then, the number of matching pairs can be less than m in \(\mathrm {LTC}(\cdot )\) because of the lack of edges, and in this case, the cost is regarded as \(\infty \). Furthermore, if \(\mathrm {cost}_h (F^{(h)} \rightarrow F^{\prime (h)})\) is larger than \(-(\log t)/\rho \) in (29b), the computation of \(\mathrm {LTC} (S^{(h)} \rightarrow S^{\prime (h)}, \mathrm {cost}_{h-1})\) is not required because \(x=0\) is determined.

Note that transportation-based graph metrics have been studied (e.g., Titouan et al. 2019), but the purpose of such studies was to evaluate the similarity between two graphs (not inclusion). Our (sim-)ASIF provides a feature with the monotonicity property as a natural relaxation of subgraph isomorphism, by which the optimality of our pruning strategy can be guaranteed. In contrast, there have been many studies on inexact graph matching (Yan et al. 2016) such as eigenvector- (Leordeanu et al. 2012; Kang et al. 2013), edit distance- (Gao et al. 2010), and random walk-based (Gori et al. 2005; Cho et al. 2010) methods. Some of these methods provide a score for the matching, which can be seen as a similarity score between a searched graph pattern and a matched graph. However, they do not guarantee the monotonicity of the similarity score for pattern growth. If the similarity score satisfies monotonicity, it can be combined with IGML. Although we only consider vertex-labels, edge-labels can also be incorporated into (sim-)ASIF. A simple approach is to transform a labeled-edge into a labeled-node with two unlabeled edges, such that (sim-)ASIF is directly applicable.

7 Experiments

We evaluate the performance of IGML using the benchmark datasets shown in Table 3. These datasets are available from Kersting et al. (2016). We did not use edge labels because the implementations of compared methods cannot deal with them, and the maximum connected graph is used if the graph is not connected. Note that IGML currently cannot directly deal with continuous attributes, so we did not use them. A possible approach would be to perform discretization or quantization before the optimization, such as taking grid points or applying clustering in the attribute space. Building a more elaborated approach, such as dynamically determining discretization, is a possible future directions. The #maxvertices column in the table indicates the size (number of vertices) of the maximum subgraph considered in IGML. To fully identify important subgraphs, a large value of #maxvertices is preferred, but this can cause a correspondingly large memory requirement to store the gSpan tree. For each dataset, we set the largest value for which IGML could finish with a tractable amount memory. The sets \({\mathcal {S}}_i\) and \({\mathcal {D}}_i\) were selected as the ten nearest neighborhoods of \({\varvec{x}}_i\) (\(K=|{\mathcal {S}}_i|=|{\mathcal {D}}_i|=10\)) by using the WL-Kernel. A sequence of the regularization coefficients was created by equally spacing 100 grid points on a logarithmic scale between \(\lambda _{\max }\) and \(0.01\lambda _{\max }\). We set the minimum support in gSpan as 0, meaning that all the subgraphs in a given dataset were enumerated for as far as the graph satisfies the #maxvertices constraint. The gSpan tree is mainly traversed when the beginning of each \(\lambda \) as shown in Algorithm 1 (in the case of WS-based approaches, the tree is also traversed at every working set update). Note that the tree is dynamically constructed during this traversal without constructing the entire tree beforehand. In the working-set method, after convergence, it is necessary to traverse the tree again in order to confirm the overall optimality. If the termination condition is not satisfied, optimization with a new working set must be performed. The termination condition for the optimization is that the relative duality gap is less than \(10^{-6}\). In the experiment, we used \(g(x) = 1_{x > 0}\) in \(\phi _H(G)\) with Lemma 2 unless otherwise noted. The dataset was randomly divided in such a way that the ratio of partitioning was \(\mathrm train:validation:test = 0.6:0.2:0.2\), and our experimental results were averaged over 10 runs.

7.1 Evaluating computational efficiency

In this section, we confirm the effect of the proposed pruning methods. We evaluated four settings: Safe Screening and Pruning: “SSP”, Range based Safe Screening and Pruning: “RSSP”, Working set Selection and Pruning: “WSP”, and the combination of WSP and RSSP: “WSP+RSSP”. Each method performed dynamic screening with DGB at every update of \({\varvec{m}}\). We here used the AIDS dataset, where #maxvertices=30. In this dataset, when we fully traversed the gSpan tree without safe screening/working set selection, the number of tree nodes was more than \(9.126 \times 10^7\), at which point our implementation with gSpan stopped because we ran out of memory.

Figure 6a shows the size of \({{\mathcal {F}}}\) after the first traverse at each \(\lambda \), and the number of non-zero \(m_k\) after the optimization is also shown as a baseline. We first observe that both approaches significantly reduced the number of features. Even for the largest case, where approximately 200 of features were finally selected by \(m_k\), only less than 1000 features remained. We observe that WSP exhibited significantly smaller values than SSP. Instead, WSP may need to perform the entire tree search again because it cannot guarantee the sufficiency of the current \({{\mathcal {F}}}\), while SSP does not need to search the tree again because it guarantees that \({{\mathcal {F}}}\) must contain all \(m_k \ne 0\).

The number of visited nodes in the first traversal at each \(\lambda \) is shown in Fig. 6b. Here, we added RSSP and WSP+RSSP, which are not shown in Fig. 6a. Note that the #remaining dimensions is same for SSP and RSSP, and for WSP and WSP+RSSP. Because RSSP is derived from SSP, it does not change the number of screened features. As we discussed in Sect. 4.2.2, WSP removes more features than RSSP, though it is not safe. We observed that the #visited nodes of SSP was the largest, but it was less than approximately 27000 (\(27000 / 9.126 \times 10^7 \approx 0.0003\)). Comparing SSP and WSP, we see that WSP pruned a larger number of nodes. In contrast, the #visited nodes of RSSP was less than 6000. The difference between SSP and RSSP indicates that a larger number of nodes can be skipped by the range-based method. Therefore, by combining the node skip by RSSP with the stronger pruning of WSP, the #visited nodes was further reduced. RSSP and WSP+RSSP had larger values at \(\lambda _0\) than the subsequent \(\lambda _i\). This is because of the effect of range-based screening and pruning. At \(\lambda _0\), every visited node in the tree calculates the ranges in which the screening and pruning rules are satisfied (i.e., RSS and RSP rules), and as a result, some nodes can be skipped during that \(\lambda _i\) is in those ranges. At every \(\lambda _i\) for \(i > 0\), the ranges are updated only in the (non-skipped) visited nodes, and thus, the range-based rules take the effect except for \(\lambda _0\).

Fig. 6
figure 6

a Size of \({{\mathcal {F}}}\), and b number of visited nodes. Both were evaluated at the first traversal of each \(\lambda \), where the index is shown on the horizontal axis. The dataset employed here was AIDS

The total time of the path-wise optimization is shown in Table 2. RSSP and WSP+RSSP were fast with regard to the traversing time, and WSP and WSP+RSSP were fast with regard to the solving time. Note that because the tree is dynamically constructed during the traverse, the ‘Traverse’ time includes the time spent on the tree construction. In total, WSP+RSSP was the fastest. These results indicate that our method only took approximately 1 minute to solve the optimization problem with more than \(9.126 \times 10^7\) variables. We also show the computational cost evaluation for other datasets in the Appendix I.

Although we have confirmed that IGML works efficiently on several benchmark datasets, completely elucidating general complexity of IGML is remains as future work. The practical complexity at least depends on the graph size in the training data, #maxvertices, #samples, and the pruning rate. In terms of the graph size, traversing a large graph dataset using gSpan can be intractable because it requires all matched subgraphs to be maintained at each tree node. Therefore, applying IGML to large graphs, e.g., graphs with more than thousands of nodes, would be difficult. Meanwhile, the scalability of IGML depends not only on the sizes of graphs but also strongly on the performance of the pruning. However, we still do not have any general analytic complexity evaluation for the rate of the pruning that avoids exponential worst-case computations. In fact, we observed that there exist datasets in which efficiency of the pruning is not sufficient. For example, on the IMDB-BINARY and IMDB-MULTI datasets, which are also from (Kersting et al. 2016), a large number of small subgraphs are shared across all the different classes and instances (i.e., \(x_{i,k} = 1\) for \(\forall i\)). Our upper bound in the pruning is based on the fact that \(x_{i,k'} \le x_{i,k}\) for descendant node \(k'\) in the mining tree. This bound becomes tighter when \(x_{i,k} = 0\) for many i because 0 is the lower bound of \(x_{i,k}\). In contrast, when many instances have \(x_{i,k} = 1\), the bound can be loose, making traversal intractable. This is an important open problems common in predictive mining methods (Nakagawa et al. 2016; Morvan and Vert 2018).

Table 2 Total time in path-wise optimization (sec) on AIDS dataset

7.2 Predictive accuracy comparison

In this section, we compare the prediction accuracy of IGML with those of the Graphlet-Kernel (GK)(Shervashidze et al. 2009), Shortest-Path Kernel (SPK)(Borgwardt and Kriegel 2005), Random-Walk Kernel (RW)(Vishwanathan et al. 2010), Weisfeiler-Lehman Kernel (WL)(Shervashidze et al. 2011), and Deep Graph Convolutional Neural Network (DGCNN)(Zhang et al. 2018a). We used the implementations available at the URLs in the footnoteFootnote 1. Note that we mainly compared methods for obtaining a metric between graphs. The graph kernel approach is one of most important existing approaches to defining a metric space of non-vector structured data. Although kernel functions are constructed in an un-supervised manner, their high prediction performance has been widely shown. In particular, the WL kernel is known for its comparable classification performance to recent graph neural networks (e.g., Niepert et al. 2016; Morris et al. 2019). Meanwhile, DGCNN can provide a vector representation of an input graph by using the outputs of some middle layer, which can be interpreted that a metric space is obtained through a supervised learning. We did not compare with (Saigo et al. 2009; Nakagawa et al. 2016; Morvan and Vert 2018) as they only focused on specific linear prediction models rather than building a general discriminative space. We employed the k-nearest neighbors (k-nn) classifier to directly evaluate the discriminative ability of feature spaces constructed by IGML and each kernel function. We here employed the k-nn classifier for directly evaluating discriminative ability of feature spaces constructed by IGML and each kernel function. A graph kernel can be seen as an inner-product \(k(G_i,G_j) = \langle \varphi (G_j), \varphi (G_j) \rangle \), where \(\varphi \) is a projection from a graph to reproducing kernel Hilbert space. Then, the distance can be written as \(\Vert \varphi (G_j) - \varphi (G_j) \Vert = \sqrt{k(G_i,G_i) - 2 K(G_i,G_j) + k(G_j,G_j)}\). The values of k for the k-nn were \(k=1, 3, 5, 7, ..., 49\) and hyperparameters of each method were selected using the validation data, and the prediction accuracy was evaluated on the test data. The graphlet size for GK was set up to 6. The parameter \(\lambda _{\mathrm{RW}}\) for RW was set to the recommended \(\lambda _{\mathrm{RW}}=\max _{i\in {\mathbb {Z}}: 10^i<1/d^2}10^i\), where d denotes the maximum degree. The loop parameter h of WL was selected from 0, 1, 2, ..., 10 by using the validation data. For DGCNN, the number of hidden units and their sort-pooling were also selected using the validation data, each ranging from 64, 128, 256 and from \(40\%, 60\%, 80\%\), respectively.

The micro-F1 score for each dataset is shown in Table 3. “IGML (Diag)” indicates IGML with the weighted squared distance (1), and “IGML (Diag\(\rightarrow \)Full)” indicates that with post-processing using the Mahalanobis distance (22). “IGML (Diag)” yielded the best or comparable to the best score on seven out of nine datasets. This result is impressive because IGML uses a much simpler metric than the other methods. Among the seven datasets, “IGML (Diag\(\rightarrow \)Full)” slightly improved the mean accuracy on four datasets, but the difference was not significant. This may suggest that the diagonal weighting can have enough performance in many practical settings. WL kernel also exhibited superior performance, showing the best or comparable to the best accuracy on six datasets. DGCNN showed high accuracy with on the DBLP_v1 dataset, which has a large number of samples, while its accuracy was low for the other datasets.

Table 3 Comparison of micro-F1 score. OOM means out of memory

7.3 Illustrative examples of selected subgraphs

Figure 7 shows an illustrative example of IGML on the Mutagenicity dataset, where mutagenicity was predicted from a graph representation of molecules. Figure 7a is a graphical representation of subgraphs, each of which has a weight shown in (b). For example, we can clearly see that subgraph #2 is estimated as an important sub-structure to discriminate different classes. Figure 7c shows a heatmap of the transformation matrix \(\sqrt{{\varvec{\Lambda }}}{\varvec{V}}^\top \) optimized for the thirteen features, containing three non-zero eigenvalues. For example, we see that the subgraphs of #10 and #12 have similar columns in the heatmap. This indicates that these two similar subgraphs (#10 contains #12) are shrunk to almost same representation by the regularization term \(R(\varvec{M})\).

Fig. 7
figure 7

Examples of selected subgraphs. a: Illustrations of subgraphs. b: Learned weights of subgraphs. c: Transformation matrix heatmap (24)

Fig. 8
figure 8

Examples of paths on decision tree constructed by selected subgraphs. #samples indicates the number of samples satisfying all preceding conditions

As another example of graph data analysis on the learned representation, we applied the decision tree algorithm to the obtained feature (23) on the Mutagenicity dataset. Although there has been a study constructing a decision tree directly for graph data (Nguyen et al. 2006), it requires a severe restriction on the patterns to be considered for computational feasibility. In contrast, because (23) is a simple vector representation with a reasonable dimension, it is quite easy to apply the decision tree algorithm. We selected two paths from the obtained decision tree as shown in Fig. 8. For example, in the path (a), if a given graph contains “\({\varvec{O}}= {\varvec{N}}\)”, and does not contain “\({\varvec{H}}- {\varvec{O}}- {\varvec{C}}- {\varvec{C}}= {\varvec{C}}- {\varvec{C}}- {\varvec{H}}\)”, and contains “\({\varvec{N}}- {\varvec{C}}= {\varvec{C}}- {\varvec{C}}= {\varvec{C}} < \begin{array}{l}{\varvec{C}}\\ {\varvec{C}}\end{array}\)”, the given graph is predicted as \(y=0\) with probability 140/146. Both rules clearly separate the two classes, which is highly insightful as we can trace the process of the decision based on the subgraphs.

7.4 Experiments for three extensions

In this section, we evaluate the performance of the three extensions of IGML described in Sect. 6.

First, we evaluated the performance of IGML on itemset and sequence data using the benchmark datasets shown in the first two rows of Tables 4 and 5. These datasets can be obtained from (Dua and Graff 2017) and (Chang and Lin 2011), respectively. We set the maximum-pattern size considered by IGML as 30. Table 4 lists the micro-F1 scores on the itemset datasets. We used k-nn with the Jaccard similarity as a baseline, where k was selected using the validation set, as described in Sect. 7.2. The scores of both of IGML (Diag) and (Diag\(\rightarrow \)Full) were superior to those of the Jaccard similarity on all datasets. Table 5 lists the micro-F1 scores on the sequence dataset. Although IGML (Diag) did not outperform the mismatch kernel (Leslie et al. 2004) for the promoters dataset, IGML (Diag\(\rightarrow \)Full) achieved a higher F1-score than the kernel on all datasets. Figure 9 shows an illustrative example of sequences identified by IGML on the promoters dataset, where the task was to predict whether an input DNA sequence stems from a promoter region. Figure 9a is a graphical representation of the sequence, and the corresponding weights are shown in (b). For example, the sub-sequence #1 in (a) can be considered as an important sub-sequence to discriminate different classes.

Table 4 Micro-F1 scores on itemset datasets
Table 5 Micro-F1 scores on sequence datasets
Fig. 9
figure 9

Examples of a selected sequences and b their weights for the promoters dataset

Second, we show the results of the triplet formulation described in Sect. 6.2. To create the triplet set \({\mathcal {T}}\), we followed the approach in Shen et al. (2014), where k neighborhoods in the same class \({\varvec{x}}_j\) and k neighborhoods in different classes \({\varvec{x}}_l\) were sampled for each \({\varvec{x}}_i\) (\(k=4\)). Here, IGML with the pairwise loss is referred to as ‘IGML (Pairwise)’, and IGML with the triplet loss is referred to as ‘IGML (Triplet)’. Table 6 compares the micro-F1 scores of IGML (Pairwise) and IGML (Triplet). IGML (Triplet) showed higher F1-scores than IGML (Pairwise) on three of nine datasets, but it was not computable on the two datasets due to running out of memory (OOM). This is because the pruning rule in the triplet case (25) was looser than in the pair-wise case.

Table 6 Comparison of IGML (Pairwise) and IGML (Triplet) in terms of micro-F1 score

Finally, we evaluated the sim-ASIF (32). We set the scaling factor of the exponential function as \(\rho =1\), the threshold of the feature as \(t=0.7\), and the number of re-labeling steps as \(T=3\). We employed a simple heuristic approach to create a dissimilarity matrix among vertex-labels using labeled graphs in the given dataset. Suppose that the set of possible vertex-labels is \({{\mathcal {L}}}\), and \(f(\ell ,\ell ')\) is the frequency that \(\ell \in {{\mathcal {L}}}\) and \(\ell ' \in {{\mathcal {L}}}\) are adjacent in all graphs of the dataset. By concatenating \(f(\ell ,\ell ')\) for all \(\ell ' \in {{\mathcal {L}}}\), we obtained a vector representation of a label \(\ell \). We normalized this vector representation such that the vector had the unit L2 norm. By calculating the Euclidean distance of this normalized representations, we obtained the dissimilarity matrix of vertex-labels. We are particularly interested in the case where the distribution of the vertex-label frequency is largely different between the training and test datasets, because in this case the exact matching of IGML may not be suitable to provide a prediction. We synthetically emulated this setting by splitting the training and test datasets using a clustering algorithm. Each input graph was transformed into a vector created by the frequencies of each vertex-label \(\ell \in {{\mathcal {L}}}\) contained in that graph. Subsequently, we applied the k-means clustering to split the dataset into two clusters, for which \({{{\mathcal {C}}}}_1\) and \({{{\mathcal {C}}}}_2\) denote sets of assigned data points, respectively. We used \({{\mathcal {C}}}_1\) for the training and validation datasets and \({{\mathcal {C}}}_2\) is used as the test dataset, where \(|{{\mathcal {C}}}_1| \ge |{{\mathcal {C}}}_2|\). Following the same partitioning policy as in the above experiments, the size of the validation data was set as the same size of \({{\mathcal {C}}}_2\), resulting from which the size of the training set was \(|{{\mathcal {C}}}_1| - |{{\mathcal {C}}}_2|\). Table 7 lists the comparison of the micro-F1 scores on the AIDS, Mutagenicity, and NCI1 datasets. We did not consider other datasets as their training set sizes created from the above procedure were too small. We fixed the #maxvertices of sim-ASIF to 8, which was less than the value in our original IGML evaluation Table 3, because sim-ASIF takes more time than the feature without vertex-label similarity. For the original IGML, we show the result for the setting in Table 3 and the results with #maxvertices 8. IGML with sim-ASIF was superior to the original IGML for the both #maxvertices settings on the AIDS and NCI1 datasets, although it has smaller #maxvertices settings, as shown in Table 7. On the Mutagenicity dataset, sim-ASIF was inferior to the original IGML reported in Table 3, but in the comparison under the same #maxvertices value, their performances were comparable. These results suggest that when the exact matching of the subgraph is not appropriate, sim-ASIF can improve the prediction performance of IGML.

Table 7 Evaluation of sim-ASIF with micro-F1 score

7.5 Performance on frequency feature

In this section, we evaluate IGML with \(g(x)=\log (1+x)\) instead of \(g(x)=1_{x>0}\). Note that because computing the frequency without overlapping \(\#(H\sqsubseteq G)\) is NP-complete (Schreiber and Schwöbbermeyer 2005), in addition to the exact count, we evaluated the feature defined by an upper bound of \(\#(H\sqsubseteq G)\) (see Appendix J for details). We employed \(\log \) because the scale of the frequency x is highly diversified. Based on the results in Sect. 7.1, we used WSP+RSSP in this section. The #maxvertices for each dataset followed those in Table 3.

The comparison of micro-F1 scores for the exact \(\#(H\sqsubseteq G)\) and approximation of \(\#(H\sqsubseteq G)\) is shown in Table 8. The exact \(\#(H\sqsubseteq G)\) did not complete five datasets mainly due to the computational difficulty of the frequency counting. In contrast, the approximate \(\#(H\sqsubseteq G)\) completed on all datasets. Overall, for both the exact and approximate frequency features, the micro-F1 scores were comparable with the case of \(g(x)=1_{x>0}\) shown in Table 3.

Table 9 lists the total times for the path-wise optimization for the exact \(\#(H\sqsubseteq G)\) and the approximation of \(\#(H\sqsubseteq G)\). On the AIDS dataset, the exact \(\#(H\sqsubseteq G)\) did not complete within a day, while the traversal time using approximate \(\#(H\sqsubseteq G)\) was only 8.6 sec. On the BZR dataset, the traversal time using the exact \(\#(H\sqsubseteq G)\) was seven times that using the approximate \(\#(H\sqsubseteq G)\). The solving time for the approximation was lower because \(|\mathcal {F}|\) after traversing of the approximation was significantly less than that of the exact \(\#(H\sqsubseteq G)\) in this case. Because the approximate \(\#(H\sqsubseteq G)\) is an upper bound of the exact \(\#(H\sqsubseteq G)\), the variation of the values of the exact \(\#(H\sqsubseteq G)\) was smaller than the approximate \(\#(H\sqsubseteq G)\). This resulted in higher correlations among features created by the exact \(\#(H\sqsubseteq G)\). It is known that the elastic-net regularization tends to select correlated features simultaneously (Zou and Hastie 2005), and therefore, \(| {{\mathcal {F}}}|\) in the case of the exact \(\#(H\sqsubseteq G)\) becomes larger than in the approximate case.

Figure 10 shows the number of visited nodes, size of the feature subset \(|{{\mathcal {F}}}|\) after traversal, and the number of selected features on the AIDS dataset with the approximate \(\#(H\sqsubseteq G)\). This indicates that IGML keeps the number of subgraphs tractable even if \(g(x)=\log (1+x)\) is used as the feature. The #visited nodes was less than 3500, and \(|\mathcal {F}|\) after traversal was sufficiently close to \(|\{k\mid {\hat{m}}_k>0\}|\). We see that #visited nodes at \(\lambda _0\) is larger than many subsequent \(\lambda _i\)s, and this is the effect of range-based rules, as shown in the case of Fig. 6b.

Table 8 Micro-F1 scores for \(g(x)=\log (1+x)\)
Table 9 Total time of path-wise optimization (sec) for \(g(x)=\log (1+x)\)
Fig. 10
figure 10

Results of IGML with \(g(x)=\log (1+x)\) on AIDS dataset

8 Conclusions

In this paper, we proposed an interpretable metric learning method for graph data, named interpretable graph metric learning (IGML). To avoid computational difficulty, we built an optimization algorithm that combines safe screening, working set selection, and their pruning extensions. We also discussed the three extensions of IGML: (a) applications to other structured data, (b) triplet loss-based formulation, and (c) incorporating vertex-label similarity into the feature. We empirically evaluated the performance of IGML compared with existing graph classification methods. Although IGML was the only method with clear interpretability, it showed superior or comparable prediction performance compared to other state-of-the-art methods. The practicality of IGML was further demonstrated through some illustrative examples of identified subgraphs. Although IGML optimized the metric within tractable time in the experiments, the subgraphs were restricted to moderate sizes (up to 30), and a current major bottleneck for extracting larger-sized subgraphs is the memory requirement of the gSpan tree. Therefore, mitigating this memory consumption is an important future directions to apply IGML to a wider class of problems.