Improving biomedical signal search results in big data case-based reasoning environments

https://doi.org/10.1016/j.pmcj.2015.09.006Get rights and content

Abstract

Time series subsequence matching has importance in a variety of areas in healthcare informatics. These include case-based diagnosis and treatment as well as discovery of trends among patients. However, few medical systems employ subsequence matching due to high computational and memory complexities. This paper proposes a randomized Monte Carlo sampling method to broaden search criteria with minimal increases in computational and memory complexities over R-NN indexing. Information gain improves while producing result sets that approximate the theoretical result space, query results increase by several orders of magnitude, and recall is improved with no significant degradation to precision over R-NN matching.

Introduction

Medical case-based reasoning (CBR) is a well-studied area for medical diagnosis and treatments [1], [2], [3]. These systems rely on data mining and machine learning techniques to derive decisions on new cases based on a knowledge-base of previous cases. One major drawback of CBR systems is that the knowledge base must contain relevant cases to correctly make decisions on a current case  [3], [4]. This is especially difficult for high dimensional measured signals, such as Electrocardiogram (ECG)  [5] and accelerometers which exhibit high variability due to noise, sensor displacement, misuse, and other factors that are difficult to control [6]. Systems hope to use such information to accurately classify signals and patients for an accurate medical diagnosis  [7], dealing with the complexities as well as the potential issues of missing data  [8]. A CBR system based on high dimensional measured signals must be extremely large to not only account for the variability of patients, but to the variability of the signal type. The memory and computational complexity of such systems can limit the information gain provided. Indeed, as medical devices produce larger quantities of data and more frequently, efficient search becomes paramount in identifying important information and searching for useful and related signals. This work will investigate not only the quality of information presented in the signal search engine developed, but also the speed with which such a system returns results, for usefulness in a case-based reasoning environment.

High dimensional subsequence matching, or R nearest neighbor (R-NN), is the process of finding similar segments within a database of high dimensional measured signals. A match is defined as any two segments u,vS such that dist(u,v)R where S is the search space of biomedical signals, dist is a measure of distance between two signals (such as Euclidean distance), and R is a predefined threshold. In practice, R tends to be relatively small, leading to homogeneous result sets. While results may be precise, they offer little information gain. Of course, result sets can be enlarged by increasing R. However, arbitrarily increasing R can destabilize the result set, rendering it meaningless  [9].

This paper presents a randomized Monte Carlo approach for improving R-NN search results. This approach enlarges search results while ensuring precision and yielding higher relative information gain. The method is built upon two assumptions: time series databases are extremely large  [10] and result sets follow a Gaussian distribution  [11], [12]. The proposed method consists of two steps. First, a query segment q undergoes m randomizations constructing a set Q of query segments where |Q|=m. Next, an R-NN search for each uQ is performed using the l2 norm (Euclidean distance). The Euclidean distance between q and all segments uQ follows a Gaussian distribution with a mean μQ and standard deviation σQ determined by the randomization.

There are several R-NN methods that exist in the literature including spatial indexes  [13], [14], [15] and Locality Sensitive Hashing (LSH)  [16], [17], [18]. This paper utilizes LSH as the underlying hash-based nearest neighbor search algorithm, but the theoretical contributions of this paper are applicable to most R-NN methods. However, the optimizations proposed by this paper were designed predominately for an LSH scheme.

Results from this paper are shown both theoretically and experimentally. Experiments are run on both synthetic random walk and real-world publicly available datasets. The randomized approach increased the number of search results by several orders of magnitude over LSH alone while keeping similar preciseness. Experimental databases contained tens of millions of indexed sub-sequences showing both correctness and scalability. However, the proposed algorithm is highly parallelizable, potentially allowing for databases of a much larger scale. The results present important information for a case-based reasoning system.

This paper is an extension of  [19] in which further investigation of case-based reasoning environments is presented, as well as scope and evaluation of the method is extended to better represent the performance improvements; results generated in this work also consider the wall-clock time taken in order to perform the tasks outlined. The rest of the paper is organized as follows: Section  2 presents the motivation and related work in signal searching; Section  3 provides the proposed method as well as its theoretical proof; and Section  3.4 describes the experimental set-up with the results and relating discussion in Section  4. Conclusions are given in Section  5.

Section snippets

Biomedical signals

Many different strategies to searching efficiently through biomedical time series exist, particularly with ECG and EEG signals. Authors in  [20] attempt to approach the problem from a multi-dimensional angle. However, their search uses a dynamic time warping approach to create exact matches for plantar pressure. The proposed method accounts for an extension beyond dynamic time warping and is not truly adaptable to a large database for case-based reasoning due to its time-complexity and

Method

The following solution, based upon previous work in  [19] is founded on two assumptions:

  • 1.

    Time series databases are extremely large; and

  • 2.

    Result sets follow a normal distribution.

The first assumption implies that only a subset of true matches is required by a query, and therefore, an accurate sampling is sufficient. The second assumption results from the finding that, in terms of classification, classes are composed of multiple sub-groupings that are Gaussian distributed  [11], [12]. This

Results

Example Monte Carlo query results are shown for the ECG, GAIT, and SYN datasets in Fig. 3. Five segments are randomly extracted and displayed. The greatest variation between segments can be seen by the ECG and SYN datasets. This is unlike LSH where each result set contained almost no variability, as shown in Fig. 4. In fact, LSH consistently retrieves a dataset size of 1 (exact match) for the SYN dataset. The poor performance of LSH for SYN is due to the randomization added during the

Conclusions

This paper presents a Monte Carlo approximation technique for subsequence matching. The number of results for the Monte Carlo approximation are significantly increased over standard Locality Sensitive Hashing (LSH). This technique adds minimal computational complexity and ensures the preciseness of results. The proposed technique takes in a subsequence as input. The subsequence is randomized m times using a Normal distribution with standard deviation equal to σr. The resulting m randomized

Acknowledgment

This publication was partially supported by Grant Number T15 LM07356 from the NIH/National Library of Medicine Medical Informatics Training Program.

References (42)

  • K. Beyer et al.

    When is “nearest neighbor” meaningful?

  • J. Woodbridge et al.

    Salient segmentation of medical time series signals

  • K. Bennett et al.

    Density-based indexing for approximate nearest-neighbor queries

  • M. Houle et al.

    Can shared-neighbor distances defeat the curse of dimensionality?

  • C. Faloutsos et al.

    Fast subsequence matching in time-series databases

  • E. Keogh et al.

    Dimensionality reduction for fast similarity search in large time series databases

    Knowl. Inf. Syst.

    (2001)
  • Y. Cai et al.

    Indexing spatio-temporal trajectories with chebyshev polynomials

  • A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of the International...
  • Y. Tao et al.

    Quality and efficiency in high dimensional nearest neighbor search

  • Q. Lv et al.

    Multi-probe lsh: efficient indexing for high-dimensional similarity search

  • J. Woodbridge, B. Mortazavi, M. Sarrafzadeh, A. Bui, A Monte Carlo approach to biomedicai time series search, in:...
  • Cited by (7)

    • Fast anytime retrieval with confidence in large-scale temporal case bases

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Finally, we discuss the outcomes and future work in Section 6. Beside the obligation to deal with large scale data that comes with ever-growing CBs, current availability of the tools to interpret big data is also encouraging CBR researchers to work on systems that could benefit from hundreds of millions of cases (e.g. [8,9]). Working with CBs of this scale could not be imagined until recently.

    • Research on case retrieval of Bayesian network under big data

      2018, Data and Knowledge Engineering
      Citation Excerpt :

      The experimental results show that the proposed method is more effective than other methods of intelligent risk prediction. To improve case-based diagnosis and treatment as well as discovery of trends among patients, Woodbridge et al. [22] propose a randomized MonteCarlo sampling method to broaden search criteria with minimal increases in computational and memory complexities. The studies [23,24] present a different CBR cycle that considers a case optimization procedure taking into account two new metrics, the quality-of-information (QoI) and the degree-of-confidence (DoC).

    • Co-design Hardware and Algorithm for Vector Search

      2023, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023
    View all citing articles on Scopus
    View full text