Introduction

In music education, individual music practice is an essential element to teach from the first period of study and that accompanies the music student and then the concert musician throughout his musical career. In [1], researchers indicate that expert music performers based their work on a methodical rehearsals’ planning and a systematic approach to problem identification, strategy selection, and evaluation, namely all components of self-regulated thinking. Consequently, in music education, the same approach is used. Self-regulated learning is a cognitive model that bases its efficiency on an effective planning and execution, and reviewed strategies to enhance learning. Moreover, this model also includes meta-strategies that require knowledge of the nature and benefit of each component of the learning process, and the ability to understand when to use it. These strategies include but are not limited to planning goals, mental rehearsal, and error understandings. Practice is a core element of music education and career, either address individual or group and it represents the way musicians learn outside the classroom. It can be often believed that from a music lesson it is possible to infer enough information to successfully understand and face the issues that rehearsals may rise. In literature, however, it can be easily found indication that beginner music students do not demonstrate any systematic plan during their individual practice sessions [2, 3]. Hence, often, no self-learning occurs, inhibiting the musical development and progress of students [4]. Furthermore, music students learn how to play music in specific contexts (e.g., music classroom) in which they have a vast set of resources to use, but at the same time, they also face challenges in communication with teachers and a highly competitive and stressful environment where to develop their studies and careers. Moreover, the traditional education model is resistant to change and difficult to adapt to students’ needs, for example one challenge related to traditional pedagogical method is that students have to master the difficulties to manage a coherent perspective between their proper proprioceptive feedback and teachers’ suggestions. Traditional music teaching, indeed, is mostly based on a dyadic relationship between teacher and student, in which the time lag between students’ performance and teachers’ feedback makes the second to be dissociated from the auditory perception of the student, as it can be read in [5]. Since, often, the time in which this relation occurs is limited to weekly lectures, this element is even more important [6]. The long period of self-studying of students can be harsh and lead to a solitary process that often can also cause a high rate of abandonment [7]. In order to best address these challenges of music learning, it is particularly useful to reflect on reflective thinking and the cognitive dimension of learning. In [8], the author analyzed ”how we think,” distinguishing four forms of thinking that are naturally present in the human mind. The fourth type highlighted was identified as reflective thinking, which became a pillar of what we actually indicated as part of metacognition, the self-regulated thinking [9]. Using the words of [8]:

Reflection involves not simply a sequence of ideas, but a consequence – a consecutive ordering in such a way that each determines the next as its proper outcome, while each outcome in turn leans back on, or refers to, its predecessors. (Dewey 1933, p. 4)

Metacognition is generally understood as “thinking about one’s own thoughts” [10], involving “active monitoring and consequent regulation and orchestration” [11] to complete a task. So, metacognition as part of the learning process can help identify what students are thinking while they engage in learning a task. In music educational curricula, traditionally, only overt behaviors were outlined as a means of assessment, e.g., the final performance of a given repertoire. The evaluation in this way is based on students’ ability to perform correctly the repertoire, taking care of pitches, rhythms, articulations, dynamics, accentuation, expressive phrasing, and so on. Rarely, the student is asked to demonstrate her/his musical understanding of music, or how s/he came to be able to perform a given repertoire. Nevertheless, it is important for students to be aware of what they know and self-assess their learning process. This important aspect of music teaching is often neglected and students are used to practice something on their own without understanding how to organize and evaluate their progress while rehearsing alone [12]. Research demonstrated that when students have the change to foster metacognitive skills, learning outcomes improve [12, 13]. In music education research, self-regulation strategies have been investigated as well, given new insights on how students learn and master music materials. Nielsen [14] studied how two college musicians used self-regulation strategies to monitor their learning outcomes. The author used observation of practice behaviors, verbal reports during practice session, and retrospective debriefing reports after the practice session to analysis of self-learning. With the same purpose of studying practice habits, Hallam [1] collected interviews of master musicians and beginners. The novice ones were also recorded playing a new piece after 10 min of practice session. Both these authors in their researchers identified students’ application of self-learning strategies during their practice sessions. Also, the frequent use of repetition of segments or single notes during practice was reported. In [2], the authors examined practice strategies used by beginner instrumental students, from 7 to 9 years old, over a 3-year period. The studies here briefly presented provide useful insights into self-learning in music practice. Nevertheless, one more aspect has to be taken into account, investigating if and how music technology can be efficiently applied to educational settings, meaning the biomechanical aspect of music playing.

The biomechanical skills necessary for an accurate and safety performance, indeed, are often limited to subjective and vague perception and based on oral content transmission between the teacher and the students [15]. It seems, then, to be reasonable to suppose that more quantitative methodologies already tested and useful in other contexts, such as in sport medicine, could be applied and be useful also in understanding and teaching biomechanical skills of music performance [16].

In particular, the background knowledge that can be helpful in understanding motor skills needed in music performance, includes the motor learning theory and technology-based systems for analysis, monitoring, and evaluating learning efficiency [17]. In motor learning theories [18], three elements are presented as essential to success: the characterization of the skills to be acquired, skills transfer between dissimilar systems, and skills acquisition without injuries. For an acquisition of the characterizing skills is necessary a scientific analysis to identify motor patterns, such as the coordination of neural and muscoloskeletal systems. Moreover, the motor behaviors of professional players can be used as a reference model to facilitate the understanding of the essential skills to be transferred to music students. By directing attention to specific motor behaviors, learners can assimilate in an efficient and effective way the skills adequate to their technique. Following these findings, an emerging literature is more and more interested in investigating how full-body and motion analysis technologies may enhance music performance and learning outcomes, minimizing at the same time the risk of injuries [19,20,21].

An emerging literature is focusing on how movement analysis technologies can be used to inform music performance by enhancing learning outcomes and preventing risk of injuries [19, 22]. People are often prone to making mistakes during analysis or, possibly, when trying to establish relationships between multiple features. Machine learning (ML) can often be successfully applied to these problems to enhance these cognitive processes.

In this study, we present a system able to perform automated classification of highly professional musicians’ and students’ performance recordings, based on motion features analysis of selected violin techniques and repertoire. Our aim is to understand which motion features can be exploited to efficiently distinguish a professional performance and to use such information for real-time student assistive technologies. In literature, many ML models are able to classify information from time series and, in particular, from motion capture (MOCAP) data. These data provide a representation of the complex spatio-temporal structure of human motion. During a traditional MOCAP session, the locations of characteristic parts on the human body such as joints are recorded over time using appropriate devices. In the literature, it is possible to find several ways to handle MOCAP data. In particular, it is possible to identify two families of approaches [23, 24]. The first one, comprising traditional ML methods, needs an initial phase where the features must be manually and carefully extracted from the data [25,26,27,28,29]. The second family, which includes deep learning methods, automatically learns both features and models from the data [30, 31]. MOCAP data studies use both families of methods based on the cardinality of the sample size. For small cardinality dataset, deep learning–based methods cannot be employed since they require a huge amount of data to be reliable and to outperform traditional ML models with context-specific experience–based engineered features. For this reason, in most studies, traditional ML techniques are employed. These techniques have already shown in the past to be successfully applied in the field of cognitive computation, in many applications [32,33,34,35]: from sequential learning [36] to sentiment analysis, as well as data management [37] and classification [38, 39].

In this work, we exploited the recording of four internationally renowned violin performers, selected by the Royal College of Music of London, and three novice students. The music selected encompassed 41 exercises, chosen by teachers of the Royal College of Music of London, from various sources of classical violin pedagogy literature. The exercises focused on several techniques, typical of a traditional pedagogical violin program, such as: handling the instrument, technique of the right and left hands, articulations studies, and repertoire pieces. Then, we propose the combination of carefully crafted features in combination with random forest (RF) to distinguish between the two skill levels of the violinist. RF [40] is considered a state-of-the-art learning algorithm for classification purposes since it has shown to be one of the most effective tools in this context [41, 42]. From a cognitive point of view, RF implements the wisdom of crowds principle, namely the aggregation of information in groups, resulting in decisions that are often better than could have been made by any single member of the group [43,44,45]. The main requirement behind this principle, which yields better results, is that there should be significant differences or diversity among the models. Many examples of the use of this principle in cognitive computation exist in the literature [44,45,46,47,48,49,50]. In accordance with the intrinsic hierarchy present in the data set, we will study two different scenarios: extrapolation with respect to different exercises and violinists. Furthermore, we will study which features are the most predictive ones of the quality of a violinist to corroborate the significance of the results. Results, both in terms of accuracy and insight on the problem, will support the proposal and the use of the proposed technique as a support tool for students to monitor and enhance their home study and practice.

The rest of the paper is organized as follows. The description of the problem and related data is reported in “Data Description”. The description of the proposed data-driven methodology is presented in “Methods.” The results of applying the methods presented in “Methods” on the problem described in “Data Description” are reported in “Experimental Results”. Finally, “Conclusions” concludes the paper.

Data Description

The data employed in the present work were collected during H2020 ICT-TELMI ProjectFootnote 1. The project studied how we learn violin playing and how technology should be designed to effectively support and enhance music instrument learning outcomes. For this reason, the collected corpus of data was designed as a collection of exercises to follow the learning path of classical violin conservatoire programs. It included several sources of data: MOCAP of the performer, violin, and bow (see Fig. 1), ambient and instrument audio and video, physiological data (electromyography) captured with Myo sensorFootnote 2, and Kinect data. The recordings took place at the Casa Paganini - InfoMus research center of the University of GenovaFootnote 3.

Fig. 1
figure 1

The violinist MOCAP setup: markers are placed over subject, bow, and violin

The recorded material was post-processed and uploaded into the repoVizz repository and made publicly availableFootnote 4. For a comprehensive description of the entire TELMI multimodal archive, refer to [51].

As previously mentioned, we recorded four internationally famous soloists, selected by the Royal College of Music of London, and three beginner violinists. For what concerns the chosen literature, modern violin students have hundreds of years of pedagogical material at their disposal, much of which is freely available online, via school and public libraries, and through their teachers. This explains the reason why the music selected for the TELMI project encompassed different exercises, chosen by teachers of the Royal College of Music of London, from various sources of classical violin pedagogy literature. The exercises focused on different techniques concerning the following: handling the instrument, technique of the right and left hands, articulations study, and repertoire pieces. The exercises take three forms: (1) those sourced from the standard published catalog of exercises, including those of Schradieck, Ševčík, and Kreutzer, which the survey the TELMI consortium did during the project found to be the most popular; (2) those sourced or adapted from the Associated Board of the Royal Schools of Music (ABRSM) examination syllabus; and (3) customized exercises developed by Madeleine Mitchell to address specific techniques with specific focus on the capabilities offered by non-notated feedback (e.g., the bowing exercises). The use of both custom and pre-existing exercises was deliberate. Due to the national and international popularity of the ABRSM system, a subset of TELMI exercises was drawn from the scales and exercises for Grades 6 to 8 to represent the intermediate level of technical development and allow students to prepare for exams in which they may already be involved. Intonation, or tuning of the individual notes, is a core technique of violin playing. This skill will be tested in every exercise that involves a notated score, and especially in the scales for the beginning and intermediate students. As a supplement to this material, one of the most popular exercises comes from Schradieck’s School of Violin Technics (1899), and in particular the first exercise from Book I: Exercises for Promoting Dexterity in the Various Positions. This exercise comprises a series of repeating scale patterns designed for careful and even control of note production. For shifting of the left hand between the seven positions on the neck of the violin, Yost’s Exercises for Changes of Position (1928) was chosen. This collection systematically tests changes of every interval on every possible shift between the seven positions (1st to 2nd, 1, etc.) on each of the four strings. The transition from one string to the next requires a rotational motion of the bow in the right hand and shift of the fingers in the left. A classic exercise to develop this technique is Kreutzer’s String Crossing from Etude No. 13 for Solo Violin. Kreutzer’s Etude No. 14 for Solo Violin was chosen as a representative of the trill technique exercise. For the articulation exercises, the TELMI repertoire list proposed Martelé from Kreutzer’s Etude No. 7 Sautillé technique is emphasized in several of Ševčík’s Violin Studies, Op. 3. For TELMI, Variation No. 16 was chosen. Spiccato articulation is again a technique emphasized in several of Ševčík’s Violin Studies, Op. 3. For TELMI, Variation No. 34 was chosen. For the Staccato was selected Kreutzer, Etude N. 4. Finally, for the Arpeggios Flesch’s System of Scales was chosen and as repertoire pieces one piece from Romantic literature and one from a contemporary composer were chosen, namely they were Elgar’s (1889) Salut d’Amour, Op. 12, and Michael Nymann’s (2007) Taking it as Read.

From this archive of data, we selected one scale, one study, and one repertoire piece, namely a ABRSM Scale in G Major played detaché on 3 octaves, String Crossing from Kreutzer Violin Study op.13, Salut d’Amour, Op.12 by Edgar, to start to investigate automatic classification from motion capture data. On this data, we computed 14 low-level features using the EyesWeb XMI platformFootnote 5 [52]. The computed features are as follows: mean shoulders’ velocity, shoulder low back asymmetry, upper body kinetic energy, left/right, bow–violin incidence, distance low/middle/upper bow–violin, hand–violin left/right head inclination, and left/right wrist roundness [21]. These features fully describe, based on the knowledge of expert players and teachers, the movements of the violinists. The final dataset included the 14 raw features described above of the 3 exercises played by each of the violinists. In particular, 7 violinists are recorded, of which 4 are experts and 3 are beginners. Nineteen files were made available containing information on the exercise performed by each musician (one exercise per class was missing). The dataset is summarized in Table 1.

Table 1 Description of the available dataset

In order to avoid noise due to the initial and final moments of the recording where musicians do not play, only timestamps where a piece of music is played are chosen.

Methods

The problem described in the previous sections can be easily mapped into a binary classification problem [23]. Let \(\mathcal {X} \subseteq \mathbb {R}^{d}\) be the input space (namely the features engineered from the different measurements), consisting of d features (see “Features Engineering”), and let \(\mathcal {Y} = \{ 0, 1 \}\) (namely expert and beginner) be the output space. Let \(\mathcal {D}_{n} = \{ (X_{1},Y_{1}),\) \(\dots ,\) (Xn,Yn)}, where \(X_{i} \in \mathcal {X}\) and \(Y_{i} \in \mathcal {Y}\)i ∈{1, ⋯ , n}, be a sequence of \(n \in \mathbb {N}^{*}\) samples drawn from \(\mathcal {X} \times \mathcal {Y}\). Let us consider a model (function) \(f: \mathcal {X} \rightarrow \mathcal {Y}\) chosen from a set \(\mathcal {F}\) of possible hypotheses. An algorithm \({\mathscr{A}}_{{\mathscr{H}}}: \mathcal {D}_{n} \times \mathcal {F} \rightarrow f\) characterized by its hyperparameters \({\mathscr{H}}\) selects a model inside a set of possible ones based on the available dataset. Note that many algorithms for solving binary classification problems exist in literature [23] but random forest (RF) has shown to be one of the most powerful ones [40,41,42] (see “Random Forests” and Appendix A). The error of f in approximating \(\mathbb {P}\{Y \mid X \}\) is measured by a prescribed metric \(M: \mathcal {F} \rightarrow \mathbb {R}\). Note also that many different metrics are available in literature for binary classification which may provide insights on the performance of the model [53] but the confusion matrix together with the accuracy (since in our case classes are balanced) is surely the most informative ones. In order to tune the performance of the \({\mathscr{A}}_{{\mathscr{H}}}\), namely to select the best set of hyperparameters, and to estimate the performance of the final model according to the desired metrics, model selection (MS) and error estimation (EE) phases need to be performed [54] (see “Model Selection and Error Estimation”). Moreover, in order to understand, from a cognitive point of view, how the algorithm exploits the derived features to make a prediction, a feature ranking phase is also performed (see “Feature Ranking”). The entire violinist skill-level classification process pipeline is as shown in Fig. 2.

Fig. 2
figure 2

The violinist skill-level classification pipeline

Features Engineering

In this section, we describe how the features have been extracted and engineered from the raw data described in “Data Description”.

The time series of the different measurements (left head inclination, left wrist roundness, etc.) were sampled in fixed-width sliding windows of 10 s. One second in the original time series acquired from MOCAP contains 1000 rows. Then, the fixed-width sliding windows contain information about 1000 rows times 10 s for a total of 10,000 rows, which is sufficient to capture cycles in human activities. Note that the fixed-width sliding windows have 50% of overlap in time. This heuristic has been already successfully employed in many different works in literature [55,56,57]. In order to analyze clear data and avoid noise due to the initial and final moments of recording, only timestamps where a piece of music is played are chosen. From each sampled window, a vector of features was obtained by computing standard measures previously employed in literature to describe human actions [57,58,59,60] such as the mean, the signal-pair correlation, and the signal magnitude area for both the time and frequency domains (see Table 2). The fast Fourier transform was employed to find the frequency components for each window. A new set of features was also employed in order to improve the learning performance, including energy of different frequency bands, frequency skewness, and frequency kurtosis. Table 2 contains the list of all the measures applied to the time and frequency domain signals. This results in a total of 332 features. The resulting dataset has been made freely available for the research communityFootnote 6.

Table 2 List of measures for computing feature vectors

Random Forests

A powerful algorithm, both in terms of theoretical properties and practical effectiveness [41, 42], for classification is RF developed in [40] for the first time. In order to be able to fully understand RF, we need to recall how a binary decision tree (DT) [61] is defined and constructed. A binary DT for classification is a recursive binary three-structure in which a node represents a check on a particular feature, each branch defines the outcome of the check, and the leaf nodes represent the final classification. A particular path of exploration from the root of the tree to one of its leaves represents a classification rule. Based on a recursive schema, a DT is grown until it reaches a desired depth nd. Each node of the DT (both root and nodes) is constructed by choosing the features and the check that most effectively separates the data satisfying the partial rule into two subsets based on the information gain (or possible other metrics like the classification accuracy). Given this definition of DT, it is then possible to understand RF and the learning phase of each of the nt DT which composes the forest. From \(\mathcal {D}_{n}\), a bootstrap sample (sample with replacement) \(\mathcal {D}'\) of nb is extracted. Then, a DT is learned based on \(\mathcal {D}^{\prime }\), but the best check/cut is selected among a subset of nv features over the possible nf features randomly chosen at each node. nd is set to infinite, namely the DT is grown until every sample of \(\mathcal {D}^{\prime }\) is correctly classified. In the forward phase, namely the phase in which a previously unseen X needs to be labeled, each DT composing the RF is exploited to classify X; the final classification is taken with majority vote. Note that nb, nv, nd, and nt are the hyperparameters of the RF. If nb = n, \(n_{v} = \sqrt {d}\), and \(n_{d} = \infty \), we obtain the original RF formulation [40], where nt is usually chosen to trade-off accuracy and efficiency [62] since the larger it is the better.

The Metrics

For what concerns the metrics M(f) exploited for evaluating the performance of a model f learned from the data based on the methods described above, we have to recall that many different metrics are available in literature [53]. In this work, we will report just the most common ones. In order to define them, let us first consider a subset of the available data \(\mathcal {T}_{t}\), also called test set, coming from μ but different form \(\mathcal {D}_{n}\) since the data that have been used to learn f should be different from the ones exploited to evaluate its performance so to avoid overfitting [54]. Let us define the element in the confusion matrix, the true positive (\(\text {TP}(f) = {\sum }_{(X,Y) \in \mathcal {T}_{m}: Y = 1}\) {f(X) = 1}), the true negative (\(\text {TN}(f) = {\sum }_{(X,Y) \in \mathcal {T}_{m}: Y = 0}\) {f(X) = 0}), the false positive (\(\text {FP}(f) = {\sum }_{(X,Y) \in \mathcal {T}_{m}: Y = 0}\) {f(X) = 1}), and the false negative (\(\text {FN}(f) = {\sum }_{(X,Y) \in \mathcal {T}_{m}: Y = 1}\) {f(X) = 0}), on this data. Then, we can also define the accuracy as:

$$ \begin{array}{@{}rcl@{}} \text{accuracy}(f) = \frac{ \text{TP}(f) + \text{TN}(f)}{ \text{TP}(f) + \text{FN}(f) + \text{TN}(f) + \text{FP}(f)}, \end{array} $$
(1)

the precision as

$$ \begin{array}{@{}rcl@{}} \text{precision}(f) = \frac{TP(f)}{TP(f) + FP(f)}, \end{array} $$
(2)

the recall as

$$ \begin{array}{@{}rcl@{}} \text{recall}(f) = \frac{TP(f)}{TP(f) + FN(f)}, \end{array} $$
(3)

and the area under the receiver operating characteristic curve (ROC-AUC), which is the area under the TP(f) rate against the FP(f) rate curve.

Model Selection and Error Estimation

MS and EE face and address the problem of tuning and assessing the performance of a learning algorithm [54]. In this work, we will exploit the resampling techniques which leverage on a simple idea: \(\mathcal {D}_{n}\) is resampled many (nr) times, with or without replacement, and three independent datasets called learning, validation, and test sets, respectively \({\mathscr{L}}^{r}_{l}\), \(\mathcal {V}^{r}_{v}\), and \(\mathcal {T}^{r}_{t}\), with r ∈{1,⋯ ,nr} are defined. Note that \({\mathscr{L}}^{r}_{l} \cap \mathcal {V}^{r}_{v} = \oslash \), \({\mathscr{L}}^{r}_{l} \cap \mathcal {T}^{r}_{t} = \oslash \), \(\mathcal {V}^{r}_{v} \cap \mathcal {T}^{r}_{t} = \oslash \), and \({\mathscr{L}}^{r}_{l} \cup \mathcal {V}^{r}_{v} \cup \mathcal {T}^{r}_{t} = \mathcal {D}_{n}\) for all r ∈{1,⋯ ,nr}.

Then, to select the optimal configuration of hyperparameters \({\mathscr{H}}\) of the algorithm \({\mathscr{A}}_{{\mathscr{H}}}\) in a set of possible ones \(\mathfrak {H} = \{ {\mathscr{H}}_{1}, {\mathscr{H}}_{2}, {\cdots } \}\), namely to perform the MS phase, the following procedure has to be applied:

$$ \begin{array}{@{}rcl@{}} \mathcal{H}^{*}:\quad \arg {\min}_{\mathcal{H} \in \mathfrak{H}} \ {\sum}_{r = 1}^{n_{r}} M(\mathscr{A}_{\mathcal{H}}(\mathcal{L}^{r}_{l}),\mathcal{V}^{r}_{v}), \end{array} $$
(4)

where \({\mathscr{A}}_{{\mathscr{H}}}({\mathscr{L}}^{r}_{l})\) is a model learned by \({\mathscr{A}}\) with the hyperparameters \({\mathscr{H}}\) based on the the data in \({\mathscr{L}}^{r}_{l}\) and where \(M(f,\mathcal {V}^{r}_{v})\) is a desired metric. Since the data in \({\mathscr{L}}^{r}_{l}\) are independent from the ones in \(\mathcal {V}^{r}_{v}\), the intuition is that \({\mathscr{H}}^{*}\) should be the configuration of hyperparameters which allows achieving optimal performance, according to the desired metric, on a set of data that is independent, namely previously unseen, with respect to the training set.

Then, in order to evaluate the performance of the optimal model, namely the model learned with the optimal hyperparameters based on the available data, which is \(f^{*}_{{\mathscr{A}}} = {\mathscr{A}}_{{\mathscr{H}}^{*}}(\mathcal {D}_{n})\) or, in other words, to perform the EE phase, the following procedure has to be applied:

$$ \begin{array}{@{}rcl@{}} M(f^{*}_{\mathscr{A}}) = \frac{1}{n_{r}} {\sum}_{r = 1}^{n_{r}} M(\mathscr{A}_{\mathcal{H}^{*}}(\mathcal{L}^{r}_{l} \cup \mathcal{V}^{r}_{v}),\mathcal{T}^{r}_{t}). \end{array} $$
(5)

Since the data in \({\mathscr{L}}^{r}_{l} \cup \mathcal {V}^{r}_{v}\) are independent from the ones in \(\mathcal {T}^{r}_{t}\), \(M({\mathscr{A}}_{{\mathscr{H}}^{*}}({\mathscr{L}}^{r}_{l} \cup \mathcal {V}^{r}_{v})\) will be an unbiased estimator of the true performance of the final model [54].

In this paper, the complete k-fold cross-validation is exploited [54, 63] since, together with bootstrap, it represents a state-of-the-art approach to the problem of MD and EE. Then, we need to set \(n_{r} \leq \binom {n}{k} \binom {n - \frac {n}{k}}{k}\), \(l = (k-2) \frac {n}{k}\), \(v = \frac {n}{k}\), and \(t = \frac {n}{k}\) and the resampling must be done without replacement [63].

Feature Ranking

Once a model is built, namely we perform the learning, MS, and EE phases, it is often required to understand how these models exploit, combine, and extract information in order to understand if the learning process has also cognitive meaning, namely it is able to capture the underlying phenomena and does not just capture spurious correlations [64, 65] by comparing the knowledge of the experts with the information learned by the models. One way to reach this goal is to perform the feature ranking (FR) phase which allows detecting if the importance of those features that are known to be relevant from a physical perspective is also appropriately taken into account, namely ranked as highly important, by the learned models. The failure of the learned model to properly account for the features, which are relevant from a cognitive point of view, might indicate poor quality in the measurements, poor learning ability of the model, or spurious correlations. FR therefore represents a fundamental phase of model checking and verification, since it should generate results consistent with the available knowledge of the phenomena under exam provided by the experts.

FR methods based on RF are one of the most effective FR techniques as shown in many researches [66, 67]. Several measures and approaches are available for FR in RF. One method is based on the permutation test combined with the mean decrease in accuracy (MDA) metric, where the importance of each feature is estimated by removing the association between the feature and outcome of the model. For this purpose, the values of the features are randomly permuted [68] and the resulting increase in error is measured. This way also the influence of the correlated features is also removed. Note that, in our case, as feature we do not intend a particular engineered feature but a particular raw feature (left head inclination, left wrist roundness, etc.), namely all the the features coming from a particular raw feature. More rigorously, for every DT, two main quantities are evaluated: the error on the out-of-bag samples as they are used during prediction and the error on the out-of-bag samples after randomly permuting the values of the features coming from a particular raw feature. The difference between these two values are then averaged over the different trees in the ensemble and this quantity represents the raw importance score for the variable under exam.

Scenarios

In our experiments, two modelization scenarios have been studied in order to understand the extrapolation capability of the data-driven models:

  • Leave One Person Out (LOPO): In this scenario, the model has been trained with all the subjects except one that will be exploited to test the resulting model;

  • Leave One Exercise Out (LOEO): In this scenario, the model has been trained with all the the exercises except one that will be exploited to test the resulting model;

Basically the two scenarios just differ in the definitions of \({\mathscr{L}}_{l}\), \(\mathcal {V}_{v}\), and \(\mathcal {T}_{t}\), that are the subset of data exploited for building, tuning, and testing the models.

In the LOEO scenario, \({\mathscr{L}}_{l}\), \(\mathcal {V}_{v}\), and \(\mathcal {T}_{t}\) have been created by randomly selecting data from one exercise to be inserted in \(\mathcal {T}_{t}\), from another exercise to be inserted in \(\mathcal {V}_{v}\), and from the remaining ones to be inserted into \({\mathscr{L}}_{l}\).

For the LOPO scenario, we have the same procedure of the LOEO scenario but where the people are considered instead of the exercises.

Experimental Results

This section reports the results of exploiting the methodology presented in “Methods” on the problem described in “Data Description” using the data described in the very same section. In all the experiments, we set: nr = 100, nt = 1000, \(n_{d} = \infty \), nb = n, and \(n_{v} \in \{\sqrt [1/3]{n_{f}}, \sqrt {n_{f}}, \sqrt [3/4]{n_{f}}\}\). Experts (violinists 1 to 4) are labeled with Y = 0 and beginners (violinists 5 to 7) with Y = 1.

Recognition Performances for LOEO and LOPO

Let us present first the recognition performance for the LOPO scenario. Table 3 reports the accuracy on each of the violinists, the overall accuracy, precision, recall, ROC, and ROC-AUC, and the overall confusion matrix. Table 3a shows that some violinists are easier to classify as beginner or experts with respect to the others. Nonetheless, on average, the recognition results are quite high (> 70%). More in detail, violinists 1, 2, and 5 are mostly correctly classified having a mean accuracy score very close to 100%. Other violinists, such as violinists 3 and 6, are seldom wrongly classified. Violinists 4 and 7, instead, are mostly wrongly classified. In “A Comment on the Recognition Performances”, we will give more insights on the reason of this behavior. The confusion matrix of Table 3b is reported for completeness.

Table 3 (a) Accuracy

Let us present now the recognition performance for the LOEO scenario. Table 4a, b, c, and d are the counterparts, for the LOEO scenario, of Table 3a, b, c, and d for the LOPO scenario. As one can expect, recognition accuracies in the LOEO scenario are higher than the LOPO ones. In fact, the training phase of the LOEO scenario has more information available with respect to the LOPO one, namely we have to predict if an exercise was performed by an expert or beginner violinist but having the same violinists play other exercises. In particular, it is possible to notice how the scale exercises of violinist 3 are very hard to correctly predict while, for the other two exercises, we obtain a very high accuracy. The expert violinist 4 is easy to be classified when analyzing the scale exercise but it is very hard to distinguish him from beginners in the other two exercises. The accuracy of the scale exercise of violinist 6, instead, has a wide variance. Violinist 7 accuracy predictions are improved with respect to the LOPO scenario but low accuracy scores are achieved in the technique exercise.

Table 4 (a) LOEO scenario

A Comment on the Recognition Performances

As we have just observed in “Recognition Performances for LOEO and LOPO”, some violinists are much more difficult to be correctly classified than others. Experts violinists 1 and 2 are always well classified as expert musicians. The same happens for violinist 5 who is mostly correctly classified as a beginner musician. Instead, we have a very low accuracy for violinist 4 and for violinist 7 in both LOPO and LOEO scenarios. In order to understand these results, we performed a further analysis. The latter consists in observing the original video data with an expert checking what and how the violinist is doing when the algorithm correctly or wrongly classifies the specific violinist.

From the videos, it is possible to observe that the two experts violinists 1 and 2 are often in the same position, with small movements of the pelvis, and with controlled breathing. This behavior led the classifier to perceive that expert violin players have this particular style in the execution of the tracks. The beginner violinist 5, instead, has some hesitations and incertitude during the performance.

From the videos, it is possible to see also that expert violinists 3 and 4 have a different style with respect to the other experts. In fact, they provide much interpretation in their performance, accentuated body sway, and strong breaths exploited to emphasize the beginning of the musical verse. Moreover, violinist 4 has a very peculiar style: he goes up and down on his toes with noticeable vibrations of the head. Violinist 7, on the other side, is a really novel violin player when compared with the other 2 beginners. Difficulties on its prediction are due to the fact that he is concerned about playing the song correctly. From the video, in fact, it is possible to observe how much he is focused on the tracks’ execution, with no presence of body sway, with controlled breathing, and with small neck and pelvis movements.

Therefore, the final classifier captures fragility, uncertainties, and hesitations in movements, associating these phenomena with a class of less experienced musicians. Although this is reasonable, the algorithm presents difficulties in understanding what are the essential properties in the execution of a repertoire piece such as, for example, breathing and the emphasis on the strong beats of the music piece—characteristics that distinguish a more experienced musician. Despite this, an analysis without including sound information can also induce an expert user to not understand the difference between these actions. These differences very well explain the mistakes of the algorithm and the videos have been made freely availableFootnote 7 for the convenience of the reader.

Further considerations that can be made observing Table 4a are that there is not an exercise type more difficult than others to predict correctly or, in other words, to better distinguish experts or beginners. In fact, the accuracies achieved considering only one exercise—technique, repertoire, or scales—are significant enough to allow us to conclude that none of these is better to lead to a more precise classification between experts and beginners. This may seem counter-intuitive because the analysis of the most technical tracks should be alone able to allow this distinction. However, since the analysis only includes movement features, this observation is no longer so restrictive and all types of exercises have the same weight in the final prediction.

Feature Ranking

In both LOPO and LOEO scenarios, we trained our model with features extracted and engineered from the 14 raw data sources discussed in “Data Description”. In order to understand which feature is more relevant in the skill classification of violin players, we applied the method discussed in “Feature Ranking”. Table 5 reports the results of this process and it is easy to notice that the upper body kinetic energy is the most informative feature for our model. This result makes sense since this feature contains information on the whole kinetic part of the upper body, incorporating a lot of information inside it (see “Data Description” for detail). The remaining features are more and more related to parts of the body far away with respect to the influence on the violin, as expected from the comments we reported in the introduction. Observing Table 5, another interesting consideration can be drawn. The left part of the body provides less information with respect to the right one. For example, the right wrist roundness is more informative than the left wrist roundness. Furthermore, this happens for all features except the left shoulder height. This fact is quite reasonable since the right part of the body is the one involved with the movement of the bow, namely the most dynamic part, while the left side is the one more responsible for the quality of the sound but it has less dynamic. In fact, the bow–violin incidence, which provides further information on the angle that is formed between the two parts of the instrument and has low dynamic but large effect on the quality of the sound, is not considered very important. Furthermore, bow–violin incidence is much more music-dependent than skills-dependent, as it is directly affected by the notes that have to be played. Since in this work we try to understand the violin players’ skill level based on their movements, completely ignoring the sound, it is quite reasonable that parts of the body with higher dynamic are more easy to exploit for this scope with respect to the ones with reduced dynamic. These results support our proposal indicating that the model is not just leaning spurious correlation but is actually understanding the process under exam. A further consideration is that the feature ranking highlights features more sensible to fragilities, uncertainties, and hesitations. This corroborates the considerations discussed in Section 1 on the appropriateness of the algorithm.

Table 5 Feature ranking of the original raw features (from top to least importance)

Conclusions

In this work, we present a computational approach applied to music education. In particular, a method for classifying the skill level of 7 violinists starting, from data collected from their performances in three exercises properly chosen, in order to better distinguish skill and familiarity in playing the musical instrument. We based our approach on the state-of-art of music education literature and cognitive theories on self-learning, metacognition, and motor learning theory. This approach lets us address the use of a computational approach in a real context. We exploited a state-of-the-art machine learning pipeline including a feature engineering phase guided by the experts of the subject combined with random forests, a state of the art algorithm for classification. Exploiting the intrinsic hierarchy in the dataset, we considered two different extrapolation scenarios, namely on exercises and on violinists, to understand the potentiality of the method. Results show the potentiality of the method but also the necessity of increasing the cardinality of the dataset in terms of both exercises and violinists. In fact, even if the method works as expected and perceives the peculiar characteristics of each exercise and violinist, it sometimes fails and gets deceived by the violinists’ interpretation or focus. We observed how this difficulty is due to a different way to play the musical instrument by each player or simply by atypical movements in the execution. Moreover, a bigger dataset can allow us to better capture these peculiarities in movements of each person, improving the final prediction and providing knowledge about the motor signature of each individual. Increasing the cardinality would also allow the use of more sophisticated tools like the deep learning models able to extract automatically the best set of features needed to lead this classification task.

Nevertheless, this is a first step forward in understand which motion features can be exploited to efficiently distinguish a professional performance and to use such information for real-time student assistive technologies.