Feature selection for unsupervised machine learning of accelerometer data physical activity clusters – A systematic review

doi:10.1016/j.gaitpost.2021.08.007

Gait & Posture

Volume 90, October 2021, Pages 120-128

https://doi.org/10.1016/j.gaitpost.2021.08.007 Get rights and content

Highlights

•
13 studies clustering physical activity were identified through systematic review.
•
Popular feature selection techniques included PCA and correlation.
•
Cluster quality evaluation methods were diverse.
•
Only four studies had more than 25 participants.
•
Multiple feature selection methods should be assessed with large cohort data.

Abstract

Background

Identifying clusters of physical activity (PA) from accelerometer data is important to identify levels of sedentary behaviour and physical activity associated with risks of serious health conditions and time spent engaging in healthy PA. Unsupervised machine learning models can capture PA in everyday free-living activity without the need for labelled data. However, there is scant research addressing the selection of features from accelerometer data. The aim of this systematic review is to summarise feature selection techniques applied in studies concerned with unsupervised machine learning of accelerometer-based device obtained physical activity, and to identify commonly used features identified through these techniques. Feature selection methods can reduce the complexity and computational burden of these models by removing less important features and assist in understanding the relative importance of feature sets and individual features in clustering.

Method

We conducted a systematic search of Pubmed, Medline, Google Scholar, Scopus, Arxiv and Web of Science databases to identify studies published before January 2021 which used feature selection methods to derive PA clusters using unsupervised machine learning models.

Results

A total of 13 studies were eligible for inclusion within the review. The most popular feature selection techniques were Principal Component Analysis (PCA) and correlation-based methods, with k-means frequently used in clustering accelerometer data. Cluster quality evaluation methods were diverse, including both external (e.g. cluster purity) or internal evaluation measures (silhouette score most frequently). Only four of the 13 studies had more than 25 participants and only four studies included two or more datasets.

Conclusion

There is a need to assess multiple feature selection methods upon large cohort data consisting of multiple (3 or more) PA datasets. The cut-off criteria e.g. number of components, pairwise correlation value, explained variance ratio for PCA, etc. should be expressly stated along with any hyperparameters used in clustering.

Introduction

Accelerometer-based devices that measure raw accelerations are popular tools for physical activity researchers given their convenience and capacity to record data for several days to several weeks at a time. However, the data can be difficult to interpret in terms of meaningful outcomes. Identifying clusters from accelerometer data could facilitate quantification of types (modes) of specific physical behaviours associated with risk of serious health conditions, such as high levels of sedentary behaviour and/or low levels of purposeful physical activity, which might contribute to increased risk of chronic disease such as type 2 diabetes and cardiovascular disease [1]. This quantification would enable identification of the time spent and/or energy expended in target behaviours associated with risk of developing type 2 diabetes [2], other lifestyle diseases or conditions whose occurrence or progression is modulated by physical activity [e.g. 3], and premature all-cause and cardiovascular mortality [4].

Machine learning concerns the use of mathematical algorithms either to find patterns in data (for example, habitual physical activity), divide data into clusters (for example by intensity of physical activity) or to identify data associated with a given output (for instance, accelerometer data associated with walking or sitting down). Machine learning algorithms are automated, learning from training data. Supervised machine learning, as the name implies, requires some human supervision – typically labelled data that it can learn from - for instance, all examples of walking in some accelerometer training data might be labelled with the same integer to enable the algorithm to use this label to learn the typical accelerometer data properties associated with it.

In contrast, unsupervised machine learning algorithms don’t rely on human input and can work without the need for any labelled data or information on outcomes [5]. An unsupervised machine learning algorithm works on its own to uncover hidden patterns or clusters (grouping similar data together). Unsupervised machine learning techniques have the advantage of learning from accelerometer data to identify meaningful clusters representing different types/categories of physical activities. This avoids the need for provision of training data labelled with activity type, which can be both time-consuming and expensive to create.

K-means [6] is a popular clustering algorithm where the number of clusters is either known, presumed or indicated beforehand (choosing k is the main drawback of this algorithm although a number of techniques exist to derive possible values for k [7], [8], [9]). The centres of each cluster (centroids) are usually either initialised randomly or where the first centroid is initialized randomly then the other centroids are chosen in such a way as to be spread out as much as possible (known as K++, [10]). K-means then associates each data point xⁱ with the closest centroid μ_j. The set of points belonging to cluster i are shown as C_i: $C_{i} = \{x : x - μ_{i} \leq ‖ x - μ_{j} ‖\} .$

The algorithm's second step recalculates the centroids of each cluster to minimise the sum of squared Euclidean distances from the data points of this cluster to the cluster centroid. $μ_{j} = \frac{1}{|C_{j}|}$ where |C_i| is the number of points in cluster. This two-step algorithm minimises the sum of squared Euclidean distances from each data point to the nearest centroid.

DBSCAN [11] is another example of an unsupervised algorithm which works by forming clusters around dense neighbourhoods. A point (X) with more than a minimum number of neighbours (N_min) within a given epsilon (ε) radius or neighbourhood (N) – i.e. the point’s epsilon neighbourhood N_ε (X) are known as core points (CP). The first such point is assigned to a new cluster. $N_{ε} (X) = \{j : X^{j} - X \leq ε\} C P = |N_{ε} (X)| \geq N_{\min}$

Points which are density reachable (with a neighbour which itself is a neighbour of another core point) are also added to the same cluster. Core points which aren’t density reachable are assigned to a new cluster and those points with sparse neighbourhoods (<N_min) are treated as noise. As with the choice of k for K-means, there remains the difficulty of choosing optimal values for parameters like ε or N_min although various methods have been proposed [12], [13].

The molecular complex detection algorithm (MCODE) [14] as its name suggests, was originally intended to cluster networks of interacting biomolecules and is density-based like DBSCAN. It uses a method of vertex weighting using a coefficient (c_i) useful for determining local neighbourhood density where ki is the vertex size of the neighbourhood of vertex i and n is the number of edges in that neighbourhood (i.e. the neighbourhood density of v not including v itself). $c_{i} = \frac{2 n}{k i (k i - 1)}$

Vertices are weighted based on the highest k-core of the vertex neighbourhood where k-core is a graph of minimal degree K (graph g, for all v in g, deg(v) > = k). The highest k-core of a graph is the central most densely connected subgraph. Using the vertex weighted graph as an input, the algorithm recursively moves outward from the seed vertex, including vertices whose weight is above a given threshold, given as a percentage away from the weight of the seed vertex (vertex weight percentage (VWP) parameter).

An accelerometer feature is a numerical representation or function of the raw accelerometer values. There are hundreds of possible accelerometer features to choose from, for example, the dominant frequency from an accelerometer signal or its mean or maximum value for a given period of time. If there are not enough features, the model will perform poorly when trying to separate the data into clusters – for example, different physical activity intensities or types scattered evenly across clusters, so that we derive no utility from this sub-division of data. However, if too many features are used, clustering models can over-fit, limiting generalisability when applied to other datasets [15]. If a model has tailored itself too much to the data it learned from, this can negatively impact its performance on one or more new datasets (for instance if sedentary behaviour is now mixed up with moderate activity). A good model learns general widely applicable principles, which is why choice of feature is so important [16]. Too many features may also diminish the ability of the model to differentiate between physical activity clusters (the so-called ‘curse of dimensionality’ [17]). Therefore, a strategy for choosing features is necessary, firstly to reduce the complexity and computational burden of the clustering model – no-one wants to wait hours or days to process physical activity data, or not to be able to process it at all given insufficient processing power. Finding appropriate features can minimise the length of time required and the processing power needed by improving the efficiency of the machine learning model. It can also facilitate generalisation – applying the same model to other physical activity datasets and achieving similar clustering results. Finally, it enables us to better understand the role played by each feature and provides better model interpretability [18].

Feature selection methods aim to choose a small subset from the full set of original features found to be most useful in clustering physical activity. One example is correlation matrices which maps the relationship between features or between features and an output (for example cluster membership) [19]. The idea is that an appropriate feature subset contains features highly correlated with cluster membership yet relatively uncorrelated to each other.

Feature extraction also aims to reduce the number of features (dimensionality reduction) but does so by creating brand new features derived from the original feature set, for example by projecting those original features onto a new space with lower dimensionality [18] and then using these for clustering. An example of this is principal component analysis [20] which attempts to transform a large set of features into a smaller one whilst retaining as much of their information as possible. Each new feature is derived from a number of original features and the best new features (the principal components) are chosen.

The aim of this study is to perform a systematic review of studies which discuss accelerometer feature selection, for the purpose of assessing physical activity and sedentary behaviour. Our review includes peer-reviewed papers aiming to cluster accelerometer data into meaningful clusters of human physical activity type e.g. walking, running or by intensity (vigorous, moderate, light, sedentary) which employed a mathematical technique to either reduce the numbers of features selected or to evaluate subsets of features against each other. Additionally, we will catalogue the numbers of participants and datasets, and the type of accelerometer device used and in what wear location (e.g. wrist, hip, back, etc.).

Section snippets

Eligibility criteria

Studies were included in this review if they fulfilled the following selection criteria:

i.
Theme: studies concerning unsupervised machine learning of human physical activity.
ii.
Device: raw accelerations captured at multiple time points per second on at least three orthogonal axes using wearable accelerometer-based devices (regardless of accelerometer wear-site), or mobile phones.
iii.
Types of machine learning: All unsupervised methods were included, examples of unsupervised methods include k-means

Number of participant datasets and study sizes

The number of datasets per study varied between 1 dataset [22], [25], [26], [27], [28], [29], [30], [31], [33] and 5 datasets [23], with multiple datasets analysed in four studies [21], [23], [24], [32] which consisted of 2 [21], [32], 3 [24] and 5 datasets [23] (Table 2). Multiple datasets were employed to test features used for clustering physical activity across different cohorts of participant e.g. child, adult [23], [24], within different environments e.g. laboratory or free-living [23],

Discussion

The aim of this systematic review was to summarise feature selection techniques applied in studies concerned with unsupervised machine learning of accelerometer-based device obtained physical activity, and to identify commonly used features identified through these techniques. Features were usually shortlisted on the basis of previous supervised machine learning studies rather than their actual performance in clustering. Typically, they included mean, standard deviation, skewness, kurtosis,

Conclusions

Through this systematic review, we have identified a core set of features that studies to date have shortlisted for inclusion in unsupervised machine learning models for clustering physical activity data derived from raw acceleration accelerometers. The most popular methods for dimensionality reduction are PCA and correlation (either pairwise or autocorrelation). Cluster evaluation methods remain diverse, often consisting of methods used in classification which rely upon labels, while internal

Conflict of interest

The authors attest that they have no conflicts of interest to disclose. This paper reflects the viewpoints of the study authors only.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgements

The authors acknowledge the assistance of the University of Leicester Library Service for assistance in obtaining some of the papers necessary to carry out this review. We also thank the National Institute for Health Research (NIHR) Leicester Biomedical Research Centre, and the NIHR Collaboration for Leadership in Applied Health Research and Care–East Midlands for their consent and encouragement for this review.

References (37)

K. Okkersen et al.
Cognitive behavioural therapy with optional graded exercise therapy in patients with severe fatigue with myotonic dystrophy type 1: a multicentre, single-blind, randomised trial
Lancet Neurol.
(2018)
P. Rousseeuw
A graphical aid to the interpretation and validation of cluster analysis
Comput. Appl. Math.
(1987)
A.L. Blum et al.
Selection of relevant features and examples in machine learning
Artif. Intell.
(1997)
H. He et al.
A wavelet tensor fuzzy clustering scheme for multi-sensor human activity recognition
Eng. Appl. Artif. Intell.
(2018)
P. Jones et al.
FilterK: A new outlier detection method for k-means clustering of physical activity
J. Biomed. Inform.
(2020)
I.P. Machado et al.
Human activity data discovery from triaxial accelerometer sensor: non-supervised learning sensitivity to feature extraction parametrization
Inf. Process. Manage.
(2015)
Y. Kwon et al.
Unsupervised learning for human activity recognition using smartphone sensors
Expert Syst. Appl.
(2014)
B.J. Jefferis et al.
Longitudinal associations between changes in physical activity and onset of type 2 diabetes in older British men: the influence of adiposity
Diabetes Care
(2012)
G. Hu et al.
Occupational, commuting, and leisure-time physical activity in relation to risk for type 2 diabetes in middle-aged Finnish men and women
Diabetologia
(2003)
U. Ekelund et al.
Dose-response associations between accelerometry measured physical activity and sedentary time and all-cause mortality: Systematic review and harmonised meta-analysis
Br. Med. J.
(2019)

S. Raschka

Python Machine Learning

(2016)

S.P. Lloyd

Least squares quantization in PCM

IEEE Trans. Inf. Theory

(1982)

R.L. Thorndike

Who belongs in the family?

Psychometrika

(1953)

T. Calinsky et al.

A dendrite method for cluster analysis

Commun. Stat.

(1972)

D. Arthur et al.

K-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms

Soc. Ind. Appl. Math.

(2007)

M. Ester et al.

A density-based algorithm for discovering clusters in large spatial databases with noise

A. Sawant

Adaptive methods for determining DBSCAN parameters

IJISET

(2014)

J. Esmaelnejad et al.

A novel method to find appropriate ε for DBSCAN

ACIIDS: Part 1. LNAI

(2010)

Cited by (16)

Absorption function loss due to the history of previous ankle sprain explored by unsupervised machine learning
2024, Gait and Posture
Ankle sprains are common and cause persistent ankle function reduction. To biomechanically evaluate the ankle function after ankle sprains, the ground reaction force (GRF) measurement during the single-legged landing had been used. However, previous studies focused on discrete features of vertical GRF (vGRF), which largely ignored vGRF waveform features that could better identify the ankle function.
To identify how the history of ankle sprain affect the vGRF waveform during the single-legged landing with unsupervised machine learning considering the time-series information of vGRF.
Eighty-seven currently healthy basketball athletes (12 athletes without ankle sprain, 49 athletes with bilateral, and 26 athletes with unilateral ankle sprain more than 6 months before the test day) performed single-legged landings from a 20 centimeters (cm) high box onto the force platform. Totally 518 trials vGRF data were collected from 87 athletes of 174 ankles, including 259 ankle sprain trials (from previous sprain ankles) and 259 non-ankle sprain trials (from without sprain ankles). The first 100 milliseconds (ms) vGRF waveforms after landing were extracted. Principal component analysis (PCA) was applied to the vGRF data, selecting 8 principal components (PCs) representing 96% of the information. Based on these 8 PCs, k-means method (k = 3) clustered the 518 trials into three clusters. Chi-square test assessed significant differences (p < 0.01) in the distribution of ankle sprain and non-ankle sprain trials among clusters.
The ankle sprain trials accounted for a significantly larger percentage (63.9%) in Cluster 3, which exhibited rapidly increased impulse vGRF waveforms with larger peaks in a short time.
PCA and k-means method for vGRF waveforms during single-legged landing identified that the history of previous ankle sprains caused a loss of ankle absorption ability lasting at least 6 months from an ankle sprain.
Machine Learning in Nutrition Research
2022, Advances in Nutrition
Data currently generated in the field of nutrition are becoming increasingly complex and high-dimensional, bringing with them new methods of data analysis. The characteristics of machine learning (ML) make it suitable for such analysis and thus lend itself as an alternative tool to deal with data of this nature. ML has already been applied in important problem areas in nutrition, such as obesity, metabolic health, and malnutrition. Despite this, experts in nutrition are often without an understanding of ML, which limits its application and therefore potential to solve currently open questions. The current article aims to bridge this knowledge gap by supplying nutrition researchers with a resource to facilitate the use of ML in their research. ML is first explained and distinguished from existing solutions, with key examples of applications in the nutrition literature provided. Two case studies of domains in which ML is particularly applicable, precision nutrition and metabolomics, are then presented. Finally, a framework is outlined to guide interested researchers in integrating ML into their work. By acting as a resource to which researchers can refer, we hope to support the integration of ML in the field of nutrition to facilitate modern research.
Emerging methods for measuring physical activity using accelerometry in children and adolescents with neuromotor disorders: a narrative review
2024, Journal of NeuroEngineering and Rehabilitation
Twenty-four-hour activity-count behavior patterns associated with depressive symptoms: Cross-sectional study by a big data-machine learning approach
2024, Research Square
Monitoring and Classification of Human Sleep Postures, Seizures, and Falls From Bed Using Three-Axis Acceleration Signals and Machine Learning
2024, SN Computer Science
A comparative predictive maintenance application based on machine and deep learning
2024, Journal of the Faculty of Engineering and Architecture of Gazi University

View all citing articles on Scopus

View full text

ReviewFeature selection for unsupervised machine learning of accelerometer data physical activity clusters – A systematic review

Highlights

Abstract

Background

Method

Results

Conclusion

Introduction

Section snippets

Eligibility criteria

Number of participant datasets and study sizes

Discussion

Conclusions

Conflict of interest

Declaration of Competing Interest

Acknowledgements

Lancet Neurol.

Comput. Appl. Math.

Artif. Intell.

Eng. Appl. Artif. Intell.

J. Biomed. Inform.

Inf. Process. Manage.

Expert Syst. Appl.

Longitudinal associations between changes in physical activity and onset of type 2 diabetes in older British men: the influence of adiposity

Diabetes Care

Occupational, commuting, and leisure-time physical activity in relation to risk for type 2 diabetes in middle-aged Finnish men and women

Diabetologia

Dose-response associations between accelerometry measured physical activity and sedentary time and all-cause mortality: Systematic review and harmonised meta-analysis

Br. Med. J.