Elsevier

Knowledge-Based Systems

Volume 206, 28 October 2020, 106374
Knowledge-Based Systems

Fast anytime retrieval with confidence in large-scale temporal case bases

https://doi.org/10.1016/j.knosys.2020.106374Get rights and content

Abstract

This work is about speeding up retrieval in Case-Based Reasoning (CBR) for large-scale case bases (CBs) comprised of temporally related cases in metric spaces. A typical example is a CB of electronic health records where consecutive sessions of a patient forms a sequence of related cases. k-Nearest Neighbors (kNN) search is a widely used algorithm in CBR retrieval. However, brute-force kNN is impossible for large CBs. As a contribution to efforts for speeding up kNN search, we introduce an anytime kNN search methodology and algorithm. Anytime Lazy kNN finds exact kNNs when allowed to run to completion with remarkable gain in execution time by avoiding unnecessary neighbor assessments. For applications where the gain in exact kNN search may not suffice, it can be interrupted earlier and it returns best-so-far kNNs together with a confidence value attached to each neighbor. We describe the algorithm and methodology to construct a probabilistic model that we use both to estimate confidence upon interruption and to automatize the interruption at desired confidence thresholds. We present the results of experiments conducted with publicly available datasets. The results show superior gains compared to brute-force search. We reach to an average gain of 87.18% with 0.98 confidence and to 96.84% with 0.70 confidence.

Introduction

Industrial scale machine learning (ML) systems have to deal with larger amounts of digital data everyday due to the exponential growth of both its generation and availability [1]. Being a member of the instance-based learning subdivision of the larger ML family, many Case-Based Reasoning (CBR) systems are not exempt from this laborious opportunity either. Reminiscent of human thinking, CBR is based on two assumptions observed in real world that similar problems have similar solutions and that problems are likely to recur [2]. Hence, it stores past problem solving experiences as cases in its case base (CB) and when a new query is made to the system, it retrieves similar past problems in its CB and reuses their solutions by adapting them to the query [3], [4]. This type of reasoning is known as lazy learning in ML literature since a CBR system does not build a model prior to a query – as opposed to eager learning methods which do so – and generalizes its cases every time a query is posed to the system. This behavior is an advantage of CBR for continuously changing large CBs since it discards the need to re-train learned models with the updated data. However, due to its lazy nature, the efficiency of its retrieval phase affects overall system performance. And in practice, a growing CB eventually causes the so-called swamping utility problem [5], [6] which emerges when adding new cases to a CB degrades the system efficiency instead of improving it.

Being simple and effective, k-Nearest Neighbor (kNN) search is a widely used algorithm in CBR retrieval in particular and in instance-based learning in general. The naïve approach to find the kNNs of a query is to perform a brute-force search in the CB by evaluating the similarity of each case to the given query and return the k most similar cases. The runtime complexity1 of this method may be acceptable for small sized CBs, but it implies an excessive execution time for large-scale CBs due to expensive similarity calculations and is likely to evolve into above mentioned utility problem. There has been significant research to speed up nearest neighbor search (NNS) some of which we will review shortly in the next section. For occasions where the speed-up to find exact neighbors is still not computationally feasible, some efforts resorted to approximation methods to find approximate enough neighbors instead.

As a contribution to these efforts, to address both exact and approximate NNS, this article introduces an anytime kNN search algorithm, Anytime Lazy kNN (ALK). We base our algorithm on a fast exact kNN algorithm Lazy kNN [7], and we extend it to a fully-fledged anytime algorithm. ALK finds exact kNNs when allowed to run to completion and otherwise, if interrupted, it returns best-so-far kNNs together with a confidence value attached to each neighbor. Confidence values reflect the expected qualities of approximate kNNs in terms of their similarities to the query compared to those of the exact kNNs. The proposed algorithm is also resumable and confidence values for approximate results increase over allocated time. Furthermore, we provide a means of confidence prediction to automatize the interruption of the algorithm by trading time with confidence in output.

ALK is equally efficient in exact kNN search as the original Lazy kNN in terms of avoiding unnecessary distance calculations that would be carried out by a standard brute-force search and it can save up significant execution time. Additionally, and as the main contribution of this work, we show that it reaches superior gains even when it is interrupted at very high confidence thresholds implying that we are very close to the exact kNNs. So, ALK gives the expert both the option to wait for the completion of the algorithm to obtain exact kNNs and the option to interrupt the search – manually and/or automatically – any time when a prompter response is needed and get best-so-far kNNs instead. In the latter case, he or she may also opt to resume the algorithm to get an output with a better confidence for the approximate results if so desires.

Anytime Lazy kNN, like its predecessor, excels specifically at domains where the CB can be organized as sequences of temporally related cases and the similarity metric takes into account the evolution of a sequence instead of treating each case individually. A good example to such domain can be found in healthcare where the electronic health record of a patient represents the sequence of his/her consecutive sessions and each new session is typically an update to this sequence. A search for similar patients regarding their medical histories should consider their session sequences. And, depending on whether the whole medical history or a part of it is queried, the similarity metric would use a time window encompassing the complete sequence or a subsequence of the health record respectively. Another natural example can be a time series (TS) dataset, where each instance is a sequence of temporally observed data. Here, each data point is essentially an update to the sequence and to assess the similarity between a query and a TS sequence in the dataset, the time window can cover the sequence fully or partially.

This article is organized as follows. Section 2 gives a background for our proposal. In Section 2.1, first we briefly review the main approach in CBR community to overcome the utility problem, then we mention various methods developed so far to speed up NNS in instance-based learning in general. Section 2.2 clarifies the concepts we use throughout this paper describing the organization of a CB for our domain of interest. In Section 2.3 we present information on the components and desired characteristics of anytime algorithms. We present the details of our proposed anytime kNN search methodology and our Anytime Lazy kNN algorithm itself in Section 3. Section 4 describes how we evaluated our algorithm and Section 5 gives highly encouraging results of experiments we conducted with real-world small to large time series datasets. Finally, we discuss the outcomes and future work in Section 6.

Section snippets

Related work

Beside the obligation to deal with large scale data that comes with ever-growing CBs, current availability of the tools to interpret big data is also encouraging CBR researchers to work on systems that could benefit from hundreds of millions of cases (e.g. [8], [9]). Working with CBs of this scale could not be imagined until recently. Quite the contrary, till today the main approach in CBR community to tackle this problem has been to control the CB growth via case base maintenance (CBM)

Anytime Lazy kNN

The main difficulty of converting an exact kNN search to an anytime algorithm lies in the quality assessment of the best-so-far neighbors. Having the search interrupted, we would like to compare the similarities of approximate and exact kNNs to the query. However, it is impossible to build an accurate quality measure for such an assessment. The reason is obvious, exact kNNs remain unknown till the end of search and even though we might have already found them earlier, we cannot be aware of this

Evaluation methodology

There have been three main goals for the development of ALK: (1) To be able to interrupt kNN search and get best-so-far kNNs when exact kNN search is not feasible; (2) to attach confidence values to best-so-far kNNs that indicate how much we can trust each one of them in the reasoning process; and, (3) to be able to automatize interruption upon reaching given confidence thresholds. The previous section detailed the steps of how we developed such an anytime algorithm and a confidence measure

Results

In this section, we provide the results of average gains in similarity assessments upon interruption and average efficiency of confidence estimation along experiments to assess if we met our design goals for ALK. Precisely, at each interruption, similarities of the best-so-far kNNs to the query, and the confidence μ together with its deviation σ for each member of the kNNs were recorded. Finally, we let the algorithm finish and we obtained the similarities of the exact kNNs to the query. And

Discussion and conclusions

In this work we introduced Anytime Lazy kNN (ALK), an anytime exact and approximate kNN algorithm to boost CBR retrieval in large-scale CBs of temporally related cases, e.g. a CB of electronic health records of patients. Our algorithm is based on Lazy kNN [7] which is an effective exact kNN algorithm in CBR literature for such domains. However, for some applications, the notable speed-up provided by this algorithm may not suffice and the execution time for exact kNN may still be intolerable.

CRediT authorship contribution statement

Mehmet Oğuz Mülâyim: Development and design of the methodology, Implementation of the proposed algorithm, writing and revising the manuscript. Josep Lluís Arcos: Development and design of the methodology, Implementation of the proposed algorithm, writing and revising the manuscript.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank to all contributors and maintainers of the UEA & UCR Time Series Classification Repository. This work has been funded by the project Playing and Singing for the Recovering Brain: Efficacy of Enriched Social-Motivational Musical Interventions in Stroke Rehabilitation (Play&Sing), Spain, 201729.31, Fundació La Marató de TV3, Spain; and, by the project Innobrain, Spain, COMRDI-151-0017 (RIS3CAT comunitats), and Feder, Spain funds. Mehmet Oğuz Mülâyim is a Ph.D. Student of the doctoral

References (48)

  • SmythB. et al.

    The utility problem analysed: A case-based reasoning perspective

  • MülâyimM.O. et al.

    Perks of being lazy: Boosting retrieval performance

  • JalaliV. et al.

    Harnessing hundreds of millions of cases: Case-based prediction at industrial scale

  • SmythB. et al.

    Remembering to forget: A competence-preserving case deletion policy for case-based reasoning systems

  • LeakeD.B. et al.

    Introduction to the special issue on maintaining case-based reasoning systems

    Comput. Intell.

    (2001)
  • JuarezJ.M. et al.

    Maintenance of case bases: Current algorithms after fifty years

  • WilsonD.C. et al.

    Maintaining case-based reasoners: Dimensions and directions

    Comput. Intell.

    (2001)
  • ArefinA.S. et al.

    GPU-FS-kNN: A software tool for fast and scalable kNN computation using GPUs

    PLoS ONE

    (2012)
  • GarciaV. et al.

    Fast k nearest neighbor search using GPU

  • J. Kolodner, Retrieving events from a case memory: A parallel implementation, in: Proc. of 1988 Case-Based Reasoning,...
  • WessS. et al.

    Using k-d trees to improve the retrieval step in case-based reasoning

  • YianilosP.N.

    Data structures and algorithms for nearest neighbor search in general metric spaces

  • KibriyaA.M. et al.

    An empirical comparison of exact nearest neighbour algorithms

  • BellmanR.

    Dynamic Programming

    (1957)
  • View full text