1 Introduction

The approximation of complex concepts using exclusively sensor data sets often proves difficult, owing to the intricate nature of real-world processes, presence of direct and indirect relations and interactions between objects involved in those processes. Numerous concepts have been developed of using domain knowledge in classifier construction with a view to taking these phenomena into account. Domain knowledge is predominantly used to narrow down the search space and facilitate interpretation of results. Such knowledge is thus used mainly in data preparation in order to eliminate irrelevant features, select the most valuable ones or develop new derivative features. The literature records a material favourable effect of such use of domain knowledge on the performance of certain data exploration methods. For instance, in Sinha and Zhao [29] and Zhao, Sinha and Ge [36] the effect is analysed of using the knowledge on the efficiency of the following classifiers: logistic regression, artificial neural networks, the k-NN (k-nearest neighbour) method, naive Bayes classifiers, decision trees and the SVM (support vector machine) method. It has been observed, though, that the use of domain knowledge for selection purposes proved least efficient in the event of decision trees and the k-NN method.

For such methods as k-NN, it is of key importance to evaluate the distance between—or in other words—similarity of two objects (e.g. patients). This requires data to be analysed at numerous levels of abstraction. However, given the semantic distance of complex concepts from sensor data, this is not feasible for classic modelling methods based on features being measured. Thus, the definition of such metrics (distance measures) remains a major challenge in data exploration. There do exist methods of defining similarity relation by way of building a metric (distance function) based on simple strategies of aggregating local similarities of the objects being compared (see Bazan [4] for more details). A chosen distance formula is there optimized by tuning local similarity features and parameters used to aggregate them. However, the main challenge in approximating the metric is the selection of such local similarities and a way of aggregation thereof, while domain knowledge shows that there usually are numerous various aspects of similarity of elements being compared. Each aspect should be examined specifically, in line with the domain knowledge. Further, also the aggregation of various aspects into a global similarity or distance should be done based on the knowledge. Therefore, the authors propose to define a semantic metric (for measuring the distance between objects) founded on a concept ontology (based on the domain knowledge) and to use it for the k-NN classifier. Ontology is understood as a finite set of concepts arranged in a hierarchy equipped with relations between concepts from different hierarchy levels.

2 Related work

For a review of existing approaches to measuring distance between concepts, the reader is referred to Pedersen et al. [25] and Taieb et al. [31]. Measures of a semantic similarity and a kinship were divided there into such types as: based on paths in a concept ontology, based on information content and context vectors. Rada et al. [27] define the notion of the semantic distance as the length of the shortest path connecting two concepts in the ontology. The longer the path, the more semantically the concepts are away. The measure of the semantic similarity between concepts based on the length and depth of the path was proposed by Wu and Palmer [35]. This approach uses the number of “is-a” edges from concepts to the nearest common LCS (lowest common subsumer) and the number of edges to the root of taxonomy. Leacock and Chodorow [21] proposed a measure of semantic similarity based on the shortest path in the lexical WordNet database [34]. The path length is scaled using the maximum taxonomy depth to a value between 0 and 1, and the similarity is calculated as the negative logarithm of this value. A measure of similarity based on the concept of information content (IC) was presented by Resnik [28]. IC, which is a measure of the specificity of a concept, is calculated for each concept in the hierarchy based on the frequency of occurrence of this concept in a broader context. Using the concept of IC, Resnik proposes a measure in which the semantic similarity of two concepts is proportional to the amount of information they share. Lin et al. [22] proposed extending Resnik’s work by scaling the information content of the superior concept of LCS by the information content of individual concepts. Hsu et al. [20] even provided a representation of such distance in the form of distance hierarchy enhancing concept classification by assigning weights to inter-concept links (edges). The distance between two values of a (categorical or numeric) feature is there measured as the total weight of edges along the path connecting two nodes (concepts), with the weights defined by an expert based on the domain knowledge. All methods of measuring semantic distance with the use of domain knowledge, described in the above-mentioned articles, relate to the comparison of concepts or values of features, which makes them useful in, for instance, discretization of features. The author-proposed method of constructing an ontology-based metric uses another approach. It is designed to determine similarity of objects covered by respective denotations of concepts, and not of concepts themselves or their features.

3 Construction of a classifier

The similarity function proposed for the purposes of exploration of a set of actual medical data enables patients to be compared in terms of the acuteness of coronary disease and thus to be evaluated for the risk of health- and life-threatening consequences. The more acute the disease, the greater the risk of heart incidents [dangerous rhythm disturbances, acute myocardial ischaemia or sudden cardiac death (SCD)]. Experimental data were provided by the Second Department of Internal Medicine of the Jagiellonian University Medical College. Two data sets were collected containing ECGs recorded with the Holter method and supplemented with clinical data of patients suffering from stable myocardial ischaemia (with sinus rhythm). From the first set (HOLTER_I), 19 features of 70 patients tested in 2006–2009 with the use of Aspel’s three-channel HolCARD 24W system were used. From the second set (HOLTER_II), 20 features of 200 patients tested in 2015–2016 with the use of 12-channel R12 monitor of the BTL CardioPoint-Holter H600 v2-23 system were used. Table 1 presents the key profile and angiographic data of both sets. Our research was designed to develop an efficient k-NN classifier with the use of the proposed similarity measure as the metric. The occurrence and non-occurrence of stable coronary disease (binary decision) were chosen as decision classes.

Table 1 Clinical profile of tested populations (the HOLTER_I and HOLTER_II sets)
Fig. 1
figure 1

CHD ontology with expert-assigned weights for the HOLTER_I set (as a comparison values in parentheses present weights determined with Monte Carlo method)

In the first stage of similarity function construction, hierarchical ontology was defined containing concepts referring to stable myocardial ischaemia. At the bottom level, sensor features (sourced directly from data set) were placed. They were selected from the entire data set so as to correspond to the recognized SCD prognostic factors [10]. Then at each level of ontology, by assigning an appropriate weight, the materiality of a given concept with respect to the higher-level concept was defined. A domain expert chose all the weights arbitrarily as a number from the (0,1) interval. The thus developed ontology is presented in Fig. 1. To benchmark prognosis efficiency, an ontology of the same structure, but with weights determined with a Monte Carlo method, was also used at the experimental stage.

The next step consisted in defining an algorithm for computing values of the function measuring similarity of objects with the use of the defined ontology with weights assigned.

The ultimate stage was the construction of a k-NN classifier using the developed metric of semantic similarity of patients.

3.1 Construction of ontology

Table 2 Prognostic SCD factors in anamnesis
Table 3 Prognostic SCD factors in supplementary tests

Determination of ontology-based distance requires predefining a concept ontology covered by the term which defines the decision problem. In line with the construction plan for such ontology, proposed by Noy and McGuinness [24], medical sciences were chosen as the domain and cardiology as the field. Then concepts were identified indicating the advancement of myocardial ischaemia, such as: alterations in the anamnesis, alterations in supplementary tests, epidemiological risks, coexisting diseases, alterations in electrophysiological tests or deviations in laboratory tests. These notions served a basis for defining the following ontology concepts: CHD (coronary heart disease), anamnesis, supplementary tests, epidemiology, coexisting diseases, electrophysiological tests, laboratory tests, ECG, HRV, QT, tachycardia and ST. Then, using top-down approach, the concepts were arranged hierarchically into a tree-like structure, starting from the most general concept (at the top), down to the most special ones (at the bottom). Each concept was assigned a property in the form of a weight, a number in the (0,1) interval, reflecting the concept materiality with respect to the concept preceding on the tree (one level up), with the proviso that the sum of weights assigned to all successors (one level down along the tree paths) of a given concept is 1. The last stage of the construction consisted in defining instances of individual concept in the form of recognized SCD prognostic factors [1, 12, 14, 26] corresponding to appropriate data set features. Selected risks are presented in Tables 2 and 3. In the CHD ontology thus developed (see Fig. 1), 19 risks were used, to which experts assigned weights in proportion to their relative importance in the denotation of the respective concept. At the bottom level, there are concept instances directly from the data set. The ontology proposed includes only selected concepts, present in the data sets. It may though be easily extended to include further elements. We should mention here that the literature does not specify the required number of risks: the larger the number of risk factors, the greater the risk of heart incidents [dangerous rhythm disturbances, acute myocardial ischaemia or sudden cardiac death (SCD)]. The OWL technology [3] was used to record and store the ontology developed.

It should be noted that the thresholds values which can be seen in the column called “Description” of mentioned tables represent the current medical knowledge, but were not used for determining values of any symbolic attributes used in the constructed ontology. The only parameters that take symbolic values are: HA, MI, DM, stimulants and gender. They represent one of the following facts: existence of some disease, usage of stimulant or usage of gender. Such facts are organic and any thresholding of that dichotomic values is not required. All other parameters are numeric and while determining similarity between patients were compared using formula, so it was also not required to use thresholding mechanism which could involve usage of some arbitrarily chosen threshold values.

3.2 Determination of the ontology-based distance value

Based on the ontology thus built, we can measure the distance between objects within the denotation of concepts described by the ontology. Each concept of the ontology describes differentiation among the objects considered, patients in the case in question. Metric (distance) proposed hereinafter has been designed to help answer the question “how similar (or dissimilar) are two patients diagnosed with myocardial ischaemia?”.

Standard metric-based techniques use such metrics as the Euclidean distance or, more generally, the Minkowski distance (p-norm) defined by formula (1) and (2), respectively.

$$\begin{aligned} d_\mathrm{Euclides}(x,y)= & {} \sqrt{\sum _{i=1}^{m}(x_i-y_i)^2}, \end{aligned}$$
(1)
$$\begin{aligned} d_\mathrm{Minkowski}(x,y)= & {} \root p \of {\sum _{i=1}^{m}|x_i-y_i|^p}, \end{aligned}$$
(2)

where m is the number of conditional features in the decision table, while \(x = [x_{1}, x_{2},\ldots , x_{m}]\) and \(y = [y_{1}, y_{2},\ldots , y_{m}]\) are the values of those m features for two objects. The parameter p is a positive integer.

However, these metrics take into consideration exclusively data collected by sensors, with no regard to interrelations among concepts at the higher level of abstractness. Moreover, they can process numeric features only. On the other hand, the ontology-based metric proposed herein can handle hierarchies and meanings of the concepts described and is free from such limitations. Its computation is a multi-stage process. In the first stage, distances are computed between feature values from sensor readings, that is, at the bottom level of the ontology. In subsequent stages, at a given ontology level, the distance between two objects in the denotation of a given concept is defined, using the distances between the respective objects in the denotations of concepts subordinate to a given concept (one level down) and their respective weights measuring their impact on the higher-level concept.

The similarity function measuring distance between two objects \(u_{i}\) and \(u_{j}\) with respect to a numeric sensor-monitored feature a at the bottom ontology level is defined by formula (3) [33]:

$$\begin{aligned} d_\mathrm{num}(u_i,u_j,a)=\frac{|a(u_i)-a(u_j)|}{R_a} \quad \text { for }i,j \in \{1,\ldots ,n\}, \end{aligned}$$
(3)

where n stands for the number of objects and \(R_{a}\) is the range of the feature values. The range may be defined as the difference between the greatest and the least values of the feature in a given data set or it may be known from the domain knowledge. Given a lack of accurately determined extreme values for certain SCD risks, the former approach is used herein. The similarity function measuring distance with respect to a symbolic (non-numeric) sensor-monitored feature (attribute) a is defined with the use of the value difference metric (VDM) method [30], in accordance with formula (4):

$$\begin{aligned} d_\mathrm{symb}(u_i,u_j,a)=\sum _{d_c \in D} |P(dec=d_c|a(u_i)=v)-P(dec=d_c|a(u_j)=w)|, \end{aligned}$$
(4)

where D stands for the set of decision classes, P is a probability distribution on the set of decision values (see formula 5), \(v, w \in V_{a}\), which is the domain of the feature a.

$$\begin{aligned} P(dec=d_c|a(u)=v)=\frac{|\{u\in U:dec(u)=d_c \wedge a(u)=v\}|}{|\{u \in U:a(u)=v\}|}, \end{aligned}$$
(5)

where U is a non-empty finite set (the “universe”), whose elements are called objects: \(U = u_{1}, u_{2}, \ldots , u_{n}\), and dec(u) is the value of the decision feature for an object u.

Finally, the similarity function defining the distance between two objects \(u_{i}\) and \(u_{j}\) with respect to a concept C arranged at a higher ontology level is defined in accordance with formula (6):

$$\begin{aligned} d_\mathrm{onto}(u_i,u_j,C) = {\left\{ \begin{array}{ll} \sum \limits _{s \in S} w_s \cdot d_\mathrm{onto}(u_i,u_j,C_s) \\ \sum \limits _{a \in A} w_s \cdot d_\mathrm{num}(u_i,u_j,a) &{} {\text { for numeric attribute a}} \\ \sum \limits _{a \in A} w_s \cdot d_\mathrm{symb}(u_i,u_j,a) &{} {\text { for symbolic attribute a}} \end{array}\right. }, \end{aligned}$$
(6)

where S stands for the set of subordinate concepts lying in the denotation of the concept C (unless sensor-monitored features lie one level down), \(w_{s}\) stands for the weight of a given subordinate concept s or a feature (at the bottom level) and \(C_{s}\) represents a concept subordinate to C, one level down (at the bottom level, it is a feature a).

It is easy to prove that the proposed similarity function meets the classic three conditions known as metric axioms (the identity and symmetry axioms follow directly from the properties of the absolute value; the triangle inequality may be proved by induction) [10]. Thus, one can try to use this function in the k-NN method as a distance measure.

4 Experiments

Using the proposed ontology-based metric, experiments were performed with k-NN classifiers. For the HOLTER_I data set, the myocardial ischaemia ontology presented in Fig. 1 was used. For the purposes of testing HOLTER_II data set, the ontology was slightly modified to adapt the concepts to features available in the set (see Fig. 3). The modification was necessary, because two different ECG monitors were used to collect data for the data sets, generating slightly different parameters. The SOFA (Simple Ontology Framework API) Java library [2] was used to represent ontology models.

To compare the efficiency of the classic k-NN classifier with the one using the proposed similarity metric, four types of tests were run: E1, E2, E3 and E4, described in Tables 4 and 5 for the data sets HOLTER_I and HOLTER_II, respectively. In the experiments, the implementation of k-NN was supported by the WEKA system [15] with the authors’ adaptation to the ontology-based metric. The parameter k (the number of neighbours taken into consideration) was set at 3 for the HOLTER_I set and 5 for the larger HOLTER_II set. The above values were chosen experimentally to give the best results. However, taking into account that k should be an odd number and a widely known rule of thumb [17] says that reasonable value is \(k = \sqrt{n}\) (where n is a number of samples), the process of searching the optimal value was started from a \(k=7\) for the HOLTER_I set and \(k=13\) for the HOLTER_II.

Fig. 2
figure 2

Diagram of the nested cross-validation

The individual tests were differentiated in terms of the metric used (the Euclidean distance or the metric based on the ontology from Figs. 1 and 3, comprising 31 concepts) and of the method of determining ontology concept weights (defined by an expert or randomly generated with a Monte Carlo method). Given the low number of items in the HOLTER_I set, the classification quality was evaluated with the n-fold cross-validation known as LOO (leaving-one-out), where the number of iterations equals the total number of objects [16, 18]. For the larger HOLTER_II set, the standard tenfold cross-validation (10-CV) [9] was used, but not in the last experiment E4, where the nested cross-validation (nested CV) [32] was used. With the nested technique, external validation was performed with the LOO method (for HOLTER_I) and 10-CV (for HOLTER_II). In each train set, 100 ontology models with randomly defined weights were generated; subsequently, the highest accuracy model (ACC) was selected with the 10-CV technique for external testing. The final result is the average of all tests. Figure 2 presents the diagram of the nested cross-validation performed. The structure and results of experiments are set forth in Tables 4 and 5 for the data sets HOLTER_I and HOLTER_II, respectively.

Table 4 Results of experiments run with the use of the proposed ontology-based similarity metric for the prediction of coronary stenosis in CHD—the HOLTER_I data set
Table 5 Results of experiments run with the use of the proposed ontology-based similarity metric for the prediction of coronary stenosis in CHD—the HOLTER_II data set

5 Conclusions

For the both data sets, the k-NN method supported by the ontology-based metric gives accuracy significantly higher than the same method supported by the Euclidean metric. For the HOLTER_I set and the ontology-based metric, the interesting thing is a minor difference in accuracy between procedure with expert-defined weights and that with randomly generated ones, which suggests that the domain knowledge-based selection of concept ontology is much more important than the selection of weights assigned to the concepts. On the other hand, the results of experiment E4 (where weights are repeatedly selected randomly and only the best ones are used in the model) indicate that, apart from the appropriate selection of concepts, appropriate weight allocation may additionally improve the accuracy of classification. The superiority of the automated (random) weight allocation is most probably attributable to difficulties faced by a human being trying to simultaneously numerically evaluate the weights with high accuracy for such a large number of ontology concepts (here 31). The exact values and differences between the weights determined by the expert and those automatically generated are shown in Figs. 1 and 3. Moreover, arbitrarily assuming 0.3 as a threshold value of the difference in the values of the weights it can be observed that for both sets, the “\(electrophysiological\ tests\)” and HRV are overestimated and the parameters “\(laboratory\ tests\)”, ST and “stimulants” were underestimated by the expert. Therefore, the logical conclusion is a reduction in the value of “\(electrophysiological\ tests\)” at the cost of “\(laboratory\ tests\)” and the weights of HRV at the cost of ST. As seen in Tables 4 and 5, for the larger data set, the accuracy of the proposed classifier is by 24% higher than the accuracy of the classic k-NN classifier. Moreover, when compared with other classification methods examined by the authors [10], the method proposed herein is most effective in prognosis of the occurrence of material coronary stenosis in the myocardial ischaemia (see Table 6).

Fig. 3
figure 3

CHD ontology with expert-assigned weights for the HOLTER_II set (as a comparison values in parentheses present weights determined with Monte Carlo method)

Table 6 Classification accuracy comparison for selected methods and the data sets HOLTER_I and HOLTER_II [10]

Given the “k-nearest neighbour” method’s relatively high computational complexity, its use supported by the ontology-based metric is only feasible if classification is based on a low number of objects. However, when compared with the classic approach (using, for instance, the Euclidean distance) it proves better owing to a significant reduction in the number of features. Namely, the ontology developed by a domain expert enabled the number of features to be reduced from 595 available in the set to just 20, thus materially shortening the computation time. Apart from computational complexity and memory requirements, another shortcoming of the method proposed herein is, as for now, a lack of a mechanism for verifying ontology quality. An interesting direction for further research also appears to be the use of the ontology-based semantic metric proposed to solving grouping problems with such tools as the c-means or hierarchical method.

Due to the fact that machine learning and especially the latest deep learning approach lack in the desired feature of explainability [13, 19], we think that the presented concept ontology could be also found useful in the process of building self-explanatory artificial neural networks supported by domain knowledge.

Another interesting idea would be to use a fuzzy ontology [11]. That way some domain knowledge which is based on the threshold values (as in Tables 2 and 3) could be safely introduced to the model. Those values could be used to divide the numerical attributes by compartments, creating this way a group of symbolic values. Without the fuzzy approach, it could lead to some classification errors on the samples with values close to the borders of the compartments. However, using the fuzzy ontology there is no such a risk. For example (see Table 2), a patient who is 64 years old would be treated by model differently to a someone who is 40, even though they would be labelled with the same symbolic value of low risk. It seems that such an approach is worth attention because it can increase the scope of the domain knowledge used and thus additionally increase the accuracy of the prediction.