ReviewFeature selection for unsupervised machine learning of accelerometer data physical activity clusters – A systematic review
Introduction
Accelerometer-based devices that measure raw accelerations are popular tools for physical activity researchers given their convenience and capacity to record data for several days to several weeks at a time. However, the data can be difficult to interpret in terms of meaningful outcomes. Identifying clusters from accelerometer data could facilitate quantification of types (modes) of specific physical behaviours associated with risk of serious health conditions, such as high levels of sedentary behaviour and/or low levels of purposeful physical activity, which might contribute to increased risk of chronic disease such as type 2 diabetes and cardiovascular disease [1]. This quantification would enable identification of the time spent and/or energy expended in target behaviours associated with risk of developing type 2 diabetes [2], other lifestyle diseases or conditions whose occurrence or progression is modulated by physical activity [e.g. 3], and premature all-cause and cardiovascular mortality [4].
Machine learning concerns the use of mathematical algorithms either to find patterns in data (for example, habitual physical activity), divide data into clusters (for example by intensity of physical activity) or to identify data associated with a given output (for instance, accelerometer data associated with walking or sitting down). Machine learning algorithms are automated, learning from training data. Supervised machine learning, as the name implies, requires some human supervision – typically labelled data that it can learn from - for instance, all examples of walking in some accelerometer training data might be labelled with the same integer to enable the algorithm to use this label to learn the typical accelerometer data properties associated with it.
In contrast, unsupervised machine learning algorithms don’t rely on human input and can work without the need for any labelled data or information on outcomes [5]. An unsupervised machine learning algorithm works on its own to uncover hidden patterns or clusters (grouping similar data together). Unsupervised machine learning techniques have the advantage of learning from accelerometer data to identify meaningful clusters representing different types/categories of physical activities. This avoids the need for provision of training data labelled with activity type, which can be both time-consuming and expensive to create.
K-means [6] is a popular clustering algorithm where the number of clusters is either known, presumed or indicated beforehand (choosing k is the main drawback of this algorithm although a number of techniques exist to derive possible values for k [7], [8], [9]). The centres of each cluster (centroids) are usually either initialised randomly or where the first centroid is initialized randomly then the other centroids are chosen in such a way as to be spread out as much as possible (known as K++, [10]). K-means then associates each data point xi with the closest centroid μj. The set of points belonging to cluster i are shown as Ci:
The algorithm's second step recalculates the centroids of each cluster to minimise the sum of squared Euclidean distances from the data points of this cluster to the cluster centroid.where |Ci| is the number of points in cluster. This two-step algorithm minimises the sum of squared Euclidean distances from each data point to the nearest centroid.
DBSCAN [11] is another example of an unsupervised algorithm which works by forming clusters around dense neighbourhoods. A point (X) with more than a minimum number of neighbours (Nmin) within a given epsilon (ε) radius or neighbourhood (N) – i.e. the point’s epsilon neighbourhood Nε (X) are known as core points (CP). The first such point is assigned to a new cluster.
Points which are density reachable (with a neighbour which itself is a neighbour of another core point) are also added to the same cluster. Core points which aren’t density reachable are assigned to a new cluster and those points with sparse neighbourhoods (<Nmin) are treated as noise. As with the choice of k for K-means, there remains the difficulty of choosing optimal values for parameters like ε or Nmin although various methods have been proposed [12], [13].
The molecular complex detection algorithm (MCODE) [14] as its name suggests, was originally intended to cluster networks of interacting biomolecules and is density-based like DBSCAN. It uses a method of vertex weighting using a coefficient (ci) useful for determining local neighbourhood density where ki is the vertex size of the neighbourhood of vertex i and n is the number of edges in that neighbourhood (i.e. the neighbourhood density of v not including v itself).
Vertices are weighted based on the highest k-core of the vertex neighbourhood where k-core is a graph of minimal degree K (graph g, for all v in g, deg(v) > = k). The highest k-core of a graph is the central most densely connected subgraph. Using the vertex weighted graph as an input, the algorithm recursively moves outward from the seed vertex, including vertices whose weight is above a given threshold, given as a percentage away from the weight of the seed vertex (vertex weight percentage (VWP) parameter).
An accelerometer feature is a numerical representation or function of the raw accelerometer values. There are hundreds of possible accelerometer features to choose from, for example, the dominant frequency from an accelerometer signal or its mean or maximum value for a given period of time. If there are not enough features, the model will perform poorly when trying to separate the data into clusters – for example, different physical activity intensities or types scattered evenly across clusters, so that we derive no utility from this sub-division of data. However, if too many features are used, clustering models can over-fit, limiting generalisability when applied to other datasets [15]. If a model has tailored itself too much to the data it learned from, this can negatively impact its performance on one or more new datasets (for instance if sedentary behaviour is now mixed up with moderate activity). A good model learns general widely applicable principles, which is why choice of feature is so important [16]. Too many features may also diminish the ability of the model to differentiate between physical activity clusters (the so-called ‘curse of dimensionality’ [17]). Therefore, a strategy for choosing features is necessary, firstly to reduce the complexity and computational burden of the clustering model – no-one wants to wait hours or days to process physical activity data, or not to be able to process it at all given insufficient processing power. Finding appropriate features can minimise the length of time required and the processing power needed by improving the efficiency of the machine learning model. It can also facilitate generalisation – applying the same model to other physical activity datasets and achieving similar clustering results. Finally, it enables us to better understand the role played by each feature and provides better model interpretability [18].
Feature selection methods aim to choose a small subset from the full set of original features found to be most useful in clustering physical activity. One example is correlation matrices which maps the relationship between features or between features and an output (for example cluster membership) [19]. The idea is that an appropriate feature subset contains features highly correlated with cluster membership yet relatively uncorrelated to each other.
Feature extraction also aims to reduce the number of features (dimensionality reduction) but does so by creating brand new features derived from the original feature set, for example by projecting those original features onto a new space with lower dimensionality [18] and then using these for clustering. An example of this is principal component analysis [20] which attempts to transform a large set of features into a smaller one whilst retaining as much of their information as possible. Each new feature is derived from a number of original features and the best new features (the principal components) are chosen.
The aim of this study is to perform a systematic review of studies which discuss accelerometer feature selection, for the purpose of assessing physical activity and sedentary behaviour. Our review includes peer-reviewed papers aiming to cluster accelerometer data into meaningful clusters of human physical activity type e.g. walking, running or by intensity (vigorous, moderate, light, sedentary) which employed a mathematical technique to either reduce the numbers of features selected or to evaluate subsets of features against each other. Additionally, we will catalogue the numbers of participants and datasets, and the type of accelerometer device used and in what wear location (e.g. wrist, hip, back, etc.).
Section snippets
Eligibility criteria
Studies were included in this review if they fulfilled the following selection criteria:
- i.
Theme: studies concerning unsupervised machine learning of human physical activity.
- ii.
Device: raw accelerations captured at multiple time points per second on at least three orthogonal axes using wearable accelerometer-based devices (regardless of accelerometer wear-site), or mobile phones.
- iii.
Types of machine learning: All unsupervised methods were included, examples of unsupervised methods include k-means
Number of participant datasets and study sizes
The number of datasets per study varied between 1 dataset [22], [25], [26], [27], [28], [29], [30], [31], [33] and 5 datasets [23], with multiple datasets analysed in four studies [21], [23], [24], [32] which consisted of 2 [21], [32], 3 [24] and 5 datasets [23] (Table 2). Multiple datasets were employed to test features used for clustering physical activity across different cohorts of participant e.g. child, adult [23], [24], within different environments e.g. laboratory or free-living [23],
Discussion
The aim of this systematic review was to summarise feature selection techniques applied in studies concerned with unsupervised machine learning of accelerometer-based device obtained physical activity, and to identify commonly used features identified through these techniques. Features were usually shortlisted on the basis of previous supervised machine learning studies rather than their actual performance in clustering. Typically, they included mean, standard deviation, skewness, kurtosis,
Conclusions
Through this systematic review, we have identified a core set of features that studies to date have shortlisted for inclusion in unsupervised machine learning models for clustering physical activity data derived from raw acceleration accelerometers. The most popular methods for dimensionality reduction are PCA and correlation (either pairwise or autocorrelation). Cluster evaluation methods remain diverse, often consisting of methods used in classification which rely upon labels, while internal
Conflict of interest
The authors attest that they have no conflicts of interest to disclose. This paper reflects the viewpoints of the study authors only.
Declaration of Competing Interest
The authors report no declarations of interest.
Acknowledgements
The authors acknowledge the assistance of the University of Leicester Library Service for assistance in obtaining some of the papers necessary to carry out this review. We also thank the National Institute for Health Research (NIHR) Leicester Biomedical Research Centre, and the NIHR Collaboration for Leadership in Applied Health Research and Care–East Midlands for their consent and encouragement for this review.
References (37)
- et al.
Cognitive behavioural therapy with optional graded exercise therapy in patients with severe fatigue with myotonic dystrophy type 1: a multicentre, single-blind, randomised trial
Lancet Neurol.
(2018) A graphical aid to the interpretation and validation of cluster analysis
Comput. Appl. Math.
(1987)- et al.
Selection of relevant features and examples in machine learning
Artif. Intell.
(1997) - et al.
A wavelet tensor fuzzy clustering scheme for multi-sensor human activity recognition
Eng. Appl. Artif. Intell.
(2018) - et al.
FilterK: A new outlier detection method for k-means clustering of physical activity
J. Biomed. Inform.
(2020) - et al.
Human activity data discovery from triaxial accelerometer sensor: non-supervised learning sensitivity to feature extraction parametrization
Inf. Process. Manage.
(2015) - et al.
Unsupervised learning for human activity recognition using smartphone sensors
Expert Syst. Appl.
(2014) - et al.
Longitudinal associations between changes in physical activity and onset of type 2 diabetes in older British men: the influence of adiposity
Diabetes Care
(2012) - et al.
Occupational, commuting, and leisure-time physical activity in relation to risk for type 2 diabetes in middle-aged Finnish men and women
Diabetologia
(2003) - et al.
Dose-response associations between accelerometry measured physical activity and sedentary time and all-cause mortality: Systematic review and harmonised meta-analysis
Br. Med. J.
(2019)
Python Machine Learning
Least squares quantization in PCM
IEEE Trans. Inf. Theory
Who belongs in the family?
Psychometrika
A dendrite method for cluster analysis
Commun. Stat.
K-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms
Soc. Ind. Appl. Math.
A density-based algorithm for discovering clusters in large spatial databases with noise
Adaptive methods for determining DBSCAN parameters
IJISET
A novel method to find appropriate ε for DBSCAN
ACIIDS: Part 1. LNAI
Cited by (16)
Machine Learning in Nutrition Research
2022, Advances in NutritionEmerging methods for measuring physical activity using accelerometry in children and adolescents with neuromotor disorders: a narrative review
2024, Journal of NeuroEngineering and RehabilitationA comparative predictive maintenance application based on machine and deep learning
2024, Journal of the Faculty of Engineering and Architecture of Gazi University