Improving biomedical signal search results in big data case-based reasoning environments
Introduction
Medical case-based reasoning (CBR) is a well-studied area for medical diagnosis and treatments [1], [2], [3]. These systems rely on data mining and machine learning techniques to derive decisions on new cases based on a knowledge-base of previous cases. One major drawback of CBR systems is that the knowledge base must contain relevant cases to correctly make decisions on a current case [3], [4]. This is especially difficult for high dimensional measured signals, such as Electrocardiogram (ECG) [5] and accelerometers which exhibit high variability due to noise, sensor displacement, misuse, and other factors that are difficult to control [6]. Systems hope to use such information to accurately classify signals and patients for an accurate medical diagnosis [7], dealing with the complexities as well as the potential issues of missing data [8]. A CBR system based on high dimensional measured signals must be extremely large to not only account for the variability of patients, but to the variability of the signal type. The memory and computational complexity of such systems can limit the information gain provided. Indeed, as medical devices produce larger quantities of data and more frequently, efficient search becomes paramount in identifying important information and searching for useful and related signals. This work will investigate not only the quality of information presented in the signal search engine developed, but also the speed with which such a system returns results, for usefulness in a case-based reasoning environment.
High dimensional subsequence matching, or nearest neighbor (-NN), is the process of finding similar segments within a database of high dimensional measured signals. A match is defined as any two segments such that where is the search space of biomedical signals, is a measure of distance between two signals (such as Euclidean distance), and is a predefined threshold. In practice, tends to be relatively small, leading to homogeneous result sets. While results may be precise, they offer little information gain. Of course, result sets can be enlarged by increasing . However, arbitrarily increasing can destabilize the result set, rendering it meaningless [9].
This paper presents a randomized Monte Carlo approach for improving -NN search results. This approach enlarges search results while ensuring precision and yielding higher relative information gain. The method is built upon two assumptions: time series databases are extremely large [10] and result sets follow a Gaussian distribution [11], [12]. The proposed method consists of two steps. First, a query segment undergoes randomizations constructing a set of query segments where . Next, an -NN search for each is performed using the norm (Euclidean distance). The Euclidean distance between and all segments follows a Gaussian distribution with a mean and standard deviation determined by the randomization.
There are several -NN methods that exist in the literature including spatial indexes [13], [14], [15] and Locality Sensitive Hashing (LSH) [16], [17], [18]. This paper utilizes LSH as the underlying hash-based nearest neighbor search algorithm, but the theoretical contributions of this paper are applicable to most -NN methods. However, the optimizations proposed by this paper were designed predominately for an LSH scheme.
Results from this paper are shown both theoretically and experimentally. Experiments are run on both synthetic random walk and real-world publicly available datasets. The randomized approach increased the number of search results by several orders of magnitude over LSH alone while keeping similar preciseness. Experimental databases contained tens of millions of indexed sub-sequences showing both correctness and scalability. However, the proposed algorithm is highly parallelizable, potentially allowing for databases of a much larger scale. The results present important information for a case-based reasoning system.
This paper is an extension of [19] in which further investigation of case-based reasoning environments is presented, as well as scope and evaluation of the method is extended to better represent the performance improvements; results generated in this work also consider the wall-clock time taken in order to perform the tasks outlined. The rest of the paper is organized as follows: Section 2 presents the motivation and related work in signal searching; Section 3 provides the proposed method as well as its theoretical proof; and Section 3.4 describes the experimental set-up with the results and relating discussion in Section 4. Conclusions are given in Section 5.
Section snippets
Biomedical signals
Many different strategies to searching efficiently through biomedical time series exist, particularly with ECG and EEG signals. Authors in [20] attempt to approach the problem from a multi-dimensional angle. However, their search uses a dynamic time warping approach to create exact matches for plantar pressure. The proposed method accounts for an extension beyond dynamic time warping and is not truly adaptable to a large database for case-based reasoning due to its time-complexity and
Method
The following solution, based upon previous work in [19] is founded on two assumptions:
- 1.
Time series databases are extremely large; and
- 2.
Result sets follow a normal distribution.
The first assumption implies that only a subset of true matches is required by a query, and therefore, an accurate sampling is sufficient. The second assumption results from the finding that, in terms of classification, classes are composed of multiple sub-groupings that are Gaussian distributed [11], [12]. This
Results
Example Monte Carlo query results are shown for the ECG, GAIT, and SYN datasets in Fig. 3. Five segments are randomly extracted and displayed. The greatest variation between segments can be seen by the ECG and SYN datasets. This is unlike LSH where each result set contained almost no variability, as shown in Fig. 4. In fact, LSH consistently retrieves a dataset size of 1 (exact match) for the SYN dataset. The poor performance of LSH for SYN is due to the randomization added during the
Conclusions
This paper presents a Monte Carlo approximation technique for subsequence matching. The number of results for the Monte Carlo approximation are significantly increased over standard Locality Sensitive Hashing (LSH). This technique adds minimal computational complexity and ensures the preciseness of results. The proposed technique takes in a subsequence as input. The subsequence is randomized times using a Normal distribution with standard deviation equal to . The resulting randomized
Acknowledgment
This publication was partially supported by Grant Number T15 LM07356 from the NIH/National Library of Medicine Medical Informatics Training Program.
References (42)
- et al.
Synergistic case-based reasoning in medical domains
Expert Syst. Appl.
(2014) - et al.
Classification of physiological signals for wheel loader operators using multi-scale entropy analysis and case-based reasoning
Expert Syst. Appl.
(2014) - et al.
Respidiag: A case-based reasoning system for the diagnosis of chronic obstructive pulmonary disease
Expert Syst. Appl.
(2014) - et al.
Combining case-based reasoning with bee colony optimization for dose planning in well differentiated thyroid cancer treatment
Expert Syst. Appl.
(2013) - et al.
Efficient processing of similarity search under time warping in sequence databases: an index-based approach
Inf. Syst.
(2004) - et al.
Inductive learning for case-based diagnosis with multiple faults
Adv. Case-Based Reason.
(2002) - et al.
Case-based reasoning in care-partner: Gathering evidence for evidence-based medical practice
Adv. Case-Based Reason.
(1998) - M. Nilsson, M. Sollenborn, Advancements and trends in medical case-based reasoning: An overview of systems and system...
- et al.
Time invariant multi electrode averaging for biomedical signals
- et al.
Clinical decision support model of heart disease diagnosis based on bayesian networks and case-based reasoning
When is “nearest neighbor” meaningful?
Salient segmentation of medical time series signals
Density-based indexing for approximate nearest-neighbor queries
Can shared-neighbor distances defeat the curse of dimensionality?
Fast subsequence matching in time-series databases
Dimensionality reduction for fast similarity search in large time series databases
Knowl. Inf. Syst.
Indexing spatio-temporal trajectories with chebyshev polynomials
Quality and efficiency in high dimensional nearest neighbor search
Multi-probe lsh: efficient indexing for high-dimensional similarity search
Cited by (7)
Fast anytime retrieval with confidence in large-scale temporal case bases
2020, Knowledge-Based SystemsCitation Excerpt :Finally, we discuss the outcomes and future work in Section 6. Beside the obligation to deal with large scale data that comes with ever-growing CBs, current availability of the tools to interpret big data is also encouraging CBR researchers to work on systems that could benefit from hundreds of millions of cases (e.g. [8,9]). Working with CBs of this scale could not be imagined until recently.
Research on case retrieval of Bayesian network under big data
2018, Data and Knowledge EngineeringCitation Excerpt :The experimental results show that the proposed method is more effective than other methods of intelligent risk prediction. To improve case-based diagnosis and treatment as well as discovery of trends among patients, Woodbridge et al. [22] propose a randomized MonteCarlo sampling method to broaden search criteria with minimal increases in computational and memory complexities. The studies [23,24] present a different CBR cycle that considers a case optimization procedure taking into account two new metrics, the quality-of-information (QoI) and the degree-of-confidence (DoC).
Co-design Hardware and Algorithm for Vector Search
2023, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023A framework for big data pre-processing and search optimization using HMGA-ACO: a hierarchical optimization approach
2019, International Journal of Computers and ApplicationsBig data healthcare system to improve healthcare information searching in the internet
2018, Enhanced Living Environments