Introduction

The application of data analytics to educational data has surged in the past few years facilitated by the adoption of online learning platforms.1 However, in parallel to the increased access to detailed information, it is crucial to identify both the right type of data and analytical approaches that will allow us to gain interpretable insights into online engagement and learning patterns.2 The process of learning extends over time and thus the analysis of temporal data is an important focus for educational data analytics. In this work, we describe a methodology for the study of time-series data collected from the engagement of learners with the tasks and stages of online courses. The analysis of temporal statistics has been shown to provide a fruitful avenue to identify learners at risk of failure,3 predicting performance,4 dropping out of a course,5,6,7,8 or identifying learner behaviours.9 Despite such developments, a recent review of the field suggested that temporal analyses are currently insufficient in number, and that additional methodologies are required.10

Temporal analytics has been used in the educational context to investigate massed versus distributed study modes, i.e., to compare the performance of learners that study the material ‘massed’ (or ‘crammed’) into a single study period to that of learners that ‘distribute’ their study of the material across a number of shorter study periods. The general conclusion has been that distributed practice is the more effective strategy.11 The benefits of such ‘spacing effect’12 have been shown over differing periods and within different contexts,13 although other reports have noted that the effect does not apply to all learning contexts.14 However, a feature of previous data analyses is that they generally allocate subjects in advance to one of the two pre-determined study modes. Indeed, pre-allocation is also an inherent restriction in supervised machine learning approaches, where labels are assigned a priori to train an algorithm.

Recent studies have collected time-series of learners’ behaviours and used them to cluster learners according to pre-selected features of the data (e.g., task focus, resource usage, etc) chosen to describe different approaches to problem solving. However, such methods are highly dependent both on the temporal features chosen as descriptors, which are based on specific knowledge of the data, as well as the number of groups that are obtained by the clustering. For example, a recent study extracted particular features from learners following a blended course (i.e., on two platforms: face-to-face and online) and identified four behavioural groups separated according to their differing levels of engagement across the two platforms.15 Such studies exemplify how the combination of temporal analytics and cluster analysis can provide insights of use to educators, course designers, and researchers in learning analytics.10,16

Here, we present an unsupervised methodology that allows the direct analysis of raw time-series gathered from the engagement of learners as they complete tasks of online courses without imposing a priori neither the statistical descriptors of the time-series nor the number or type of groups of learners to be detected. Hence the obtained learner clusters are not pre-determined or identified subjectively based on prior features but are detected algorithmically during the data analysis. To exemplify our approach, we analysed in detail the time-series (i.e., time-stamped data of task completion) of 81 learners as they undertook the six online compulsory courses that form the first year of a 2-year part-time post-graduate management degree. The courses extended over three terms and the patterns of task completion differ greatly across the learner group. Three examples of such highly distinct time-series are shown in Fig. 1, showing a variety of behaviours: from steady completion to highly massed behaviour to sporadic patterns. To highlight its applicability, we also applied the method to two additional data sets: a different set of time-series of task completion collected from the same degree programme but from a different year cohort, and a set of time-series of online interactions (not of task completion) collected by a different university and therefore with distinct characteristics.

Fig. 1
figure 1

The time-series of completion times for tasks are plotted for three example learners. The entire course is divided into three semesters in which two modules are completed during each semester. The average task completion trajectory for the entire cohort of learners is plotted for comparison (black line). Learner 65 (left, light green) completes tasks consistently earlier than average, whereas learner 48 (middle, purple line) tends to finish tasks after the average, particularly in term 3. Learner 43 (right, dark green) exhibits rather sporadic behaviour during the first and second semesters whereby a lot of tasks are finished late or during massed learning sessions. The time-series correspond to discrete events—the lines are guides to the eye conveying the temporal progression in task completion

The methodology is summarised in Fig. 2. We use the raw, time-stamped series of online actions from each learner and employ a dynamic time warping (DTW) kernel17 to calculate a similarity score between all pairs of learner time-series. Although several alternative methods exist to measure the similarity between two time-series (e.g., Euclidean distance, Fourier coefficients, auto-regressive models, edit distance, or minimum jump models),18 DTW has been shown to outperform a variety of measures in classification tasks19 and provides a principled way to use the full, raw information of the time-series without preselecting features or functional representations.20 From the ensuing DTW similarity matrix, we construct a similarity graph, where nodes are learners and weighted links represent similarities between learners. This graph construction step is carried out using the Relaxed Minimum Spanning Tree algorithm,21 which aims to encapsulate the locally strong and globally relevant similarities in the data set. Relaxed minimum spanning tree (RMST) has been shown to perform well in conjunction with the multiscale, unsupervised graph partitioning methodology of Markov Stability,22,23 which we apply to our graph to obtain clusters of learners with similar temporal behaviours. Alternative methods to cluster time-series data, with and without the creation of graphs, have been proposed in other contexts and applications.24,25,26,27 Instead of finding one particular clustering, our algorithm produces a multiscale description, given by a set of consistent clusterings of different coarseness obtained by robustly optimising across all levels of resolution in an unsupervised manner, without pre-imposing the number or type of clusters (see Fig. 3a for an example). Clusterings of different coarseness can then be used by the analyst according to their needs. If no robust clusterings are found, the algorithm will signal a lack of natural clusters in the data. Details of the computational analysis are given in the Methods section.

Fig. 2
figure 2

A summary of the time-series clustering methodology. a A time-series for each learner is compiled from their timestamps of interactions with tasks while undertaking the online course. b Each learner time-series is compared pair-wise using a dynamic time warping kernel (see Methods),17 which calculates a measure of similarity between two time-series. c We construct an adjacency matrix A where each element Aij describes the similarity between learners i and j pair-wise. Similarity is measured between 0 and 1 (blue to red), where a higher value suggests a higher similarity. d The adjacency matrix A encodes a network where each node represents a learner and the edges are the similarity between each learner. Community detection can be applied to cluster learners according to their similar time-series behaviours

Fig. 3
figure 3

Multiscale partitioning and characterisation of learner clusters. a Multiscale unsupervised partitioning of the similarity graph of learners using Markov Stability (MS). The nodes of the graph are learners (N = 81) and edges between nodes are weighted according to the dynamic time warping similarity (1) between temporal task completions. We obtain a quasi-hierarchy of graph partitions of increasing coarseness (number of clusters indicated by red line), which optimise the MS cost function (6) over different spans of Markov time t. The learners within a cluster in the partitions exhibit similar online temporal behaviours at different levels of resolution. We seek partitions that optimise MS (6) and are robust in two ways: consistent under the optimisation and persistent across scales. These are evaluated with the Variation of Information (VI), a normalised metric of distance between clusterings akin to Mutual Information42. A low value (or dip) of VI(t) (9) (cyan line) signals a partition that is consistently found by 100 repeated runs of the greedy Louvain algorithm. A long plateau block (dark grey in the heatmap) of the VI(t, t′) (10) indicates a persistent partition across scales. Robust partitions according to both criteria include the 10, 6 and 2-way partitions. See Supplementary Fig. 4 for the graph layouts of the partitions, including the 8 and 3-way partitions. b For each cluster in the 6-way partition we display the average GPR trajectory for leaners in the cluster and the average GPR trajectory of the full cohort (black line). Each community is given a descriptive tag (e.g., ‘Early Birds’) following post-hoc evaluation of the average time-series

When applied to our case study data set, our analysis identifies a set of clusterings of learners at different levels of resolution. The clusters of learners reflect the differing temporal engagement as they progress through online course. In particular, our data-driven clusters capture behaviours associated with massed (i.e., completion of a large number of tasks within a short time period) and distributed learning, as well as finer behaviours that differentiate these learning types into subgroups. For instance, at a coarse level, the algorithm identifies a cluster of learners that follow the course in a sequential and distributed manner; yet, at a finer resolution, this cluster is sub-divided into two clusters, which differ by a 1–2 week difference in the average completion times of tasks (i.e., ‘early birds’ and ‘on time’). Our approach also finds sporadic learners that skip a large number of tasks or exhibit irregular massed learning depending on particular courses or at different times of the year. Similar outcomes are observed for the other two data sets although with differences reflecting the particularities of the data. We then used exam grades a posteriori to establish whether particular online engagement behaviours can negatively affect learner performance and we compared our groupings against classification based on statistical features computed from the time-series data.

Results

Unsupervised clustering reveals clusters of learners with differing online engagement

To find groups of learners with similar online engagement in an unsupervised manner, we follow the procedure summarised in Fig. 2. We first create a similarity matrix between learners using a dynamic time warping kernel. This matrix is transformed into a similarity graph using a sparsification based on the Relaxed Minimum Spanning Tree,21 a procedure that retains global network connectivity while discarding weak similarities that can be explained through longer chains of strong similarities. Through this process, we create a graph where the nodes are learners linked by edges weighted according to their time-course similarity. Hence, two learners that complete the tasks of the course in a similar manner will be linked by a strong edge.

The constructed similarity graph is then analysed using Markov Stability (MS), a multiscale graph partitioning algorithm that uses a Markov process to scan the graph across Markov time in order to find optimised and robust partitions of the graph at any level of resolution.22,23 The partitions are found by maximising a resolution-dependent cost function (the Markov Stability) at all levels of resolution, as given by the Markov time, t. We then select robust partitions in the following sense: (i) they are persistent across scales (i.e., optimal over an extended Markov time t, as given by a plateau with a low value of VI(t, t′)), and (ii) robust to the small changes in the optimisation (i.e., consistently found as a good partition over those scales, as given by a relative dip in VI(t)). Such robust partitions identify clusters of learners that exhibit similar online temporal patterns. The definitions of the different measures and some details of the Markov Stability framework are given in Methods.

Figure 3a summarises the results of our multiscale clustering method applied to the time-series of task completion of six online courses by 81 learners pursuing a post-graduate part-time Management degree at Imperial College Business School over one year. See Methods for further details about the data. As the Markov time is increased, the level of resolution is decreased and the method reveals robust partitions of decreasing granularity. In Fig. 3a, we illustrate the partitions found from ten clusters down to two clusters, with a notably robust partition into six clusters. Note the quasi-hierarchical aggregation of the finer clusters into coarser ones, a feature that is intrinsic to the data and not imposed by our clustering algorithm. (For a more detailed view of the multiscale clustering structure, see Supplementary Fig. 1). The quasi-hierarchical organisation across levels of resolution reflects the fact that subtle temporal details characterise the finer clusters, but broader similarities of the time profiles define the coarser clusters. Hence, our computational framework allows for adjustable granularity, which can be tailored to the needs of the analyst.

To exemplify the characterisation of the results in our data set, we focus mainly on the 6-cluster partition, which contains four large groups and two single learners that remain unclustered due to their highly individual sporadic behaviour. The 6-cluster partition exhibits the largest relative drop in VI(t) and a long plateau in VI(t, t′). The 10-cluster and 8-cluster partitions are equally of interest and provide a more refined clustering consistent with the 6-way partition, as seen in Fig. 3a. The coarser 2-cluster partition is also of interest: the two clusters are found to separate learners that exhibit distributed and massed learning. In the rest of the paper, we concentrate on a more detailed description of behaviours emerging from the 6-cluster partition, as it provides a nuanced, data-driven level of resolution on the data.

Characterisation of the clusters of online learners

As shown in Fig. 3b, the 6-cluster partition is both robust and the data-driven groupings it provides have an appropriate level of resolution to gain meaningful insight into the observed patterns of online learners. Two of the clusters contain only one learner, with highly individual and sporadic behaviour. For each of the other four clusters, we use Gaussian Process Regression (GPR)28 to compute the average engagement trajectory of the group of learners, and compare it with the average GPR trajectory for the whole set of 81 learners. The computed GPRs allow us to quantify statistically the differences in the temporal patterns of the different clusters using Bayes factors of the processes. In particular, we found that the trajectories of each cluster are statistically more probable to be derived from separate processes defined within their own cluster as follows. A GPR was fitted to the entire set of trajectories and the log-likelihood of the entire set of trajectories was calculated. Equally, the log-likelihood of each separate cluster of trajectories from that same Gaussian Process was calculated. The Bayes Factor, calculated as the sum of log likelihoods of each separate cluster minus the log-likelihood of the entire set of trajectories12 was found to be large (K = 3.37 × 1010). This indicates that the behaviours of each cluster are statistically different from each other and are derived from different behavioural processes. This computation was repeated for the differences between each pair of neighbouring clusters. The Bayes factors were: K = 0.38 × 1010 between the ‘early birds’ and ‘on time’ clusters; K = 1.52 × 1010 between the ‘on time’ and ‘low engager’ clusters; and K = 0.17 × 1010 between the ‘low engager’ and ‘crammers’ clusters. These numbers provide statistical evidence of the differences between the obtained clusters.

Each of the clusters in this partition has been given a descriptive title that encapsulates the group behaviour. The learners in the ‘Early Bird’ group (green cluster) generally exhibit a highly sequential and ordered approach to their learning and tend to complete their tasks earlier than the cohort average with a systematic 1–2 week advance offset. The behaviour of learners in the ‘On time’ group (cyan cluster) is similar to the ‘Early birds’, except that they finish tasks closer to the average. Hence both the green and cyan groups present a similar ‘distributed learning’ behaviour only distinguished by a slight delay, which explains why both groups are agglomerated into a single cluster in the coarser 2-way partition (Fig. 3a). The learners in the ‘Low engagers’ (orange) cluster also exhibit relatively distributed work flow (similar to the cyan and green clusters) but with less anticipation in the second half of the year (and especially in the third term). Furthermore, this group had a high number of tasks that were never completed. The ‘Crammers’ cluster (magenta) contained learners that exhibited massed learning (indicated by the presence of plateaux in their time-series, suggesting tasks being completed in a short period of time), low-task completion and an ordering of task completion that deviates from the proposed course sequence. Finally, the outliers (learners 43 and 46), which form their own clusters, exhibit highly sporadic learning behaviours, with tasks completed at later dates without following sequentially the layout order of the course.

To further characterise our results, we computed standard time-series metrics for each learner. Figure 4 shows the graph of learners coloured according to two such statistical metrics derived directly from the time-series: the mean massed session length (commonly known as binge learning), and the percentage of completed tasks. Figure 4a shows the mean massed session length, i.e., the length of plateau in the number of tasks over time calculated via an isotonic regression (see Methods). This measure captures events where a learner has completed a large number of tasks within a short time frame. We find that the ‘Crammers’ cluster has a higher mean massed session length. Figure 4b shows the graph of learners coloured according to the percentage of tasks completed relative to the total number of available tasks. In general, the ‘Crammers’ cluster shows the lowest mean task completion (66%), followed by a completion ratio of 80% in the ‘Low Engagers’ group, and a higher mean task completion rates in the ‘On time’ (86%) and ‘Early Birds’ (90%) clusters.

Fig. 4
figure 4

Deriving standard time-series metrics for individual learners. a The graph of learners with nodes (learners) coloured according to the mean learning session length. Colourmap extends from red (a lot of tasks completed within a small time frame) to blue (few tasks completed during each learning session). b The number of tasks completed by each learner is mapped onto the graph of learners

Cluster analysis identifies groups of learners at risk of low performance

We have also carried out an a posteriori evaluation of our behavioural clusters with respect to the performance of the learners. Figure 5a shows the mapping of the final average marks on the learner graph, where we have also highlighted high performing (>70%, top 15%, ‘Distinction’) and low performing (<60%, bottom 7.5%, below ‘Merit’) learners. Figure 5b shows that 6 out of 7 low-performance learners lie in the ‘Crammers’ cluster associated with massed learning and reduced task completion. There was a specific learner (77, cyan cluster) who attained a low grade and yet did not exhibit time-series behaviours indicative of a low performance. The high performers tend to be distributed across all other clusters, suggesting that the learning behaviours of a high performer are not as critical to their success. Still, 9 out of the 13 high-performing learners are found in the ‘Early Birds’ or ‘On time’ clusters characterised by a sequential approach to their learning with minimal massed learning sessions. The sporadic learners in single clusters (43 and 46) did not attain either a low performance or a distinctly high one.

Fig. 5
figure 5

Evaluation of learner performance with respect to behavioural clusters. a A posteriori comparison of the clustering with the obtained marks shows that 6/7 learners that attained <60% (red circles) concentrate in the ‘Crammers’ massed learning cluster, while those that attained over >70% (blue circles) and the mid-range performers are distributed across all the clusters, with a higher concentration in the early-bird distributed learning cluster. A single low-performing outlier (learner 77) attained a low grade, yet displayed regular online patterns distinct from the other 6/7 low-performing learners. b The grades are mapped directly onto each learner in the network structure. c Classification of learners using an SVM (RBF kernel) and a Decision Tree (depth 4) trained on two statistical measures from the time-series: the mean binge session length and the percentage task completion (both normalised to a z-score, i.e., zero mean and unit variance). The results show poor classification accuracy: only 3/7 low-performance learners are accurately classified using the SVM and only 4/7 using the decision tree

Although our method captures information congruent with time-series statistical metrics (e.g., those shown in Fig. 4 related to massed learning and task completion rates), the data-driven clusters we obtain encompass global time-series information beyond such pre-determined standard statistical measures. To test this idea, we compared the results of our data-driven clusters to standard classification methods from Machine Learning based on statistical features. Figure 5c illustrates the classification map obtained by training two common machine-learning algorithms using the two statistical features in Fig. 4. The first learning algorithm is a support vector machine (SVM) using a radial basis function kernel and the second is a decision tree with a depth of 4 branches29 (see Methods). For both methods, we find that the accuracy of learner classification against performance is low: only 3–4 out of 7 low-performance learners were accurately predicted. This result suggests that using a finite set of pre-determined time-series features reduces the information available to differentiate the necessary behaviours relevant to performance. In contrast, our graph construction and clustering methodology utilises the full content of the time-series (including attributes that are not evident from inspection of particular statistical metrics), thus providing a more comprehensive grouping of learners with similar temporal behaviours.

Testing the methodology on two additional data sets

We have applied the methodology to analyse task completion time-series data from a second cohort of 46 learners taking the online management course at Imperial College Business School. The results we obtain are similar, as shown in the multiscale clustering presented in Supplementary Fig. 2 and the detailed analysis of the 6-cluster partition in Fig. 6a. In this case, we identified a robust 9-cluster partition (with four major clusters and five single learner clusters) and a robust 6-cluster partition (with three major clusters and three single outliers). The major clusters in the 6-way partition (shown in Fig. 6a) showed similar behaviours to those observed in the first cohort we analysed. In particular, the green cluster in Fig. 6a corresponds to the ‘Early Birds’ and ‘On time’ groups in Fig. 3, whereas the blue cluster in Fig. 6a is similar to the group of task-skipping ‘Low Engagers’ group in Fig. 3, and the purple cluster in Fig. 6a exhibits similar traits to the ‘Crammers’ cluster in Fig. 3. Within this 6-cluster partition, we found that of the 8 low-performance learners, 4/8 were located in the massed learning cluster, 2/8 were sporadic outliers, and 1/8 was in the low engagement cluster. Only 1/8 was located in the distributed learning cluster. Moreover, using standard classification procedures in Supplementary Fig. 3 we found that our methodology was superior at grouping learners with similar performance. These findings highlight the consistency of the methodology across the cohorts, yet attuned to particularities of the data.

Fig. 6
figure 6

Validation of the methodology against two additional data sets. a Application of our methodology to a second cohort of learners (N = 46) undertaking a similar online management course at Imperial Business School with a different set of tasks (No. tasks = 376) over a 1-year period. Temporal behaviours associated with the 6-way partition revealed similar behaviours to the cohort in Fig. 3. We also find that 6/8 of the low-performance students (red circles) are found in the ‘Crammers’ cluster or displayed sporadic engagement behaviour, and only 1/8 is found in the green cluster of distributed learning behaviours. The high performers, on the other hand, are diverse in their patterns of online engagement, as also observed in the first cohort. For the full analysis of this second cohort see Supplementary Fig. 2A. b Analysis of a cohort of students undertaking an anonymised course at the Open University (OULAD data set) over 250 days.30 In this case, the time-stamped data corresponds to ‘page clicks’ rather than task completions. The Markov Stability analysis finds a robust 3-way partition consistently across all level of sparsifications of the data set (full analysis in Supplementary Fig. 4). Within the 3-way partition, there are two major clusters (purple and orange) and one minor cluster (green). The purple cluster corresponds to massed learning and contains 6/7 of the low performers with a significantly lower grade (paired t-test at 5%) relative to the orange cluster of distributed learning behaviours. This difference exists at all levels of network sparsification. The minor cluster (green) includes outlier behaviours

The types of temporal engagement data collected from learners will differ across educators or institutions depending on the particularities of the Learning Management System. To test the methodology on a different kind of data, we have studied a set of 100 learners undertaking an anonymised course within the Open University (OULAD data set30). The OULAD data set differs from our data set in several ways: (i) the time-stamp data in OULAD corresponds to page clicks and not necessarily to task completion; (ii) the time stamps were coarse-grained to days; (iii) pages could be revisited. The results of applying our methodology to the OULAD data set in Fig. 6b (and Supplementary Fig. 4 of the Supplementary Information) show that the multiscale clustering is robust to the sparsification implicit in the graph creation step. A robust 3-way partition is consistently found in our analysis, with two major clusters and a minor cluster of outliers. The two major clusters corresponded to a separation of learners who exhibited higher massed learning and lower task engagement versus learners with a distributed learning. We found that 6/7 of the low-performance learners (<60%) were located in the cluster associated with massed learning, while one low-performance learner was located in the minor outlier cluster and none were in the distributed learning group.

Discussion

We have described an approach for the analysis of temporal data of online learning behaviours, in which distinct clusters of learners are obtained algorithmically without using a priori statistical information about individual behaviours or about the number or type of expected behaviours across the cohort. The mathematical framework is general, and can be applied broadly to any time-series data in physical or social sciences to identify distinct temporal behaviours. In the context of learning analytics, we showcased the method through three data sets of online learner activity of different types and origins.

Our method uses a dynamic time warping similarity kernel to generate a sparsified similarity graph between learners, to which we apply a multiscale graph partitioning algorithm in order to find optimised and robust clusters of learners with similar temporal behaviours at any level of resolution in an unsupervised manner. As our method uses the full time-series, it inherently encompasses richer temporal information than standard methods based on selecting statistical features of the time-series.

In the data sets analysed here, we obtained a quasi-hierarchy of robust partitions, from finer to coarser, which provide different levels of information, as required by the analyst. For instance, in our main case study in Fig. 3, we found robust partitions into 10, 6 and 2 clusters. The 6-way partition consists of four large learner clusters (‘Early Birds’, ‘On time’, ‘Low Engagers’ and ‘Crammers’) and two single unclustered learners (‘Sporadic outliers’), which were shown to be statistically different to each other according to the GPR Bayes factor (12). A posteriori comparison with learner performance indicates good correspondence with the obtained clustering: low performers are generally located (6 out of 7) in the ‘Crammers’ cluster (associated with massed learning and low-task completion) and are generally absent from the ‘Early Birds’ and ‘On time’ clusters (associated with distributed learning and high task completion). On the other hand, high performers are distributed across several clusters, albeit with higher prevalence in the clusters associated with distributed learning. These results provide an improved characterisation as compared to common machine-learning classification algorithms trained on two statistical measures from the time-series. The analysis could be enhanced by the use of finer partitions (e.g., the 10-way partition has clusters with over-representation of low performers (purple cluster, hypergeometric p-value = 0.00023) and high performers (charcoal cluster, hypergeometric p-value = 0.026), as seen in Supplementary Fig. 1). Similar general behaviours and classifications are obtained for the two additional data sets presented in Fig. 6 and the Supplementary Information.

The fact that low performers tend to concentrate in the massed learning cluster and be absent from the distributed learning clusters is in agreement with previous studies, which found that learners that ‘crammed’ retained less information when tested at a later date,31 and provides support for the risks associated with this behaviour. On the other hand, the fact that high performers are distributed across several clusters, albeit with higher prevalence in the clusters with high task completion and distributed learning, suggests they follow a host of diverse learning patterns, in agreement with a latent class model that suggested that the ‘spacing effect’ is less prominent for high performers.32 These observations were found to be consistent when testing our methodology on a second cohort of learners within the same institution and online degree, and broadly in agreement with a different type of data (‘page clicks’) from a set of online learners at the Open University, where we found a strong distinction between a low performing ‘massed learning’ cluster vs. a ‘distributed learning’ cluster.

Clearly, temporal behaviours do not fully account for learner performance, and this methodology is not intended as a diagnostic tool, but rather as providing a method to explore and identify learner engagement behaviours with the purpose of aid, intervention and help with course design. Combining the temporal analysis introduced here with established ‘early warning system’ analyses33 could aid in such tasks. Although educators might encourage learners to pursue a distributed study behaviour, our results suggest a nuanced approach for high performers, with flexibility provided in course design so that high-performing learners may pursue the study strategies they personally find effective.

Future work within different learning contexts, coupled with additional dependent variables of interest (e.g., learner satisfaction, career success, interruption and withdrawal rates) could be important to provide broader support for the initial results reported here. We remark that the methodology is scalable to larger data sets through adjustments of the computation of both the DTW kernel and the Markov Stability cost function (see Methods). Further improvements of the similarity kernel using constrained DTW34 and end point invariance35 could also be used to improve the sensitivity and accuracy of the method in representing the different temporal behaviours. Altogether with how online behaviour changes over time for each of the learners, these directions will constitute areas of further research.

Methods

The methods section describes the data and unsupervised mathematical pipeline used to analyse the trajectories of learners. The research was performed without any a priori knowledge or allocation of the learners, making it similar to a blind investigation.

Temporal data

The main case study of this research was based on task completion data from 81 post-experience learners pursuing a post-graduate part-time management degree at Imperial College London. These learners formed part of a cohort of 87 learners. Data from the remaining six learners was not included here as these learners either interrupted their studies or withdrew from the programme. Subjects ranged in age from 28 to 53 years old, with gender balance of 57 males to 24 females, and they resided in 18 geographically disparate countries. The data corresponds to interactions with six online courses, which together comprised the first academic year of the 2-year degree programme. Although the subjects met face-to-face at the start of each academic year, the six courses were studied completely online. Subjects proceeded in a lock-step manner through the academic year, which was split into three 10-week terms each containing two of the six courses. The anticipated study load was 5 to 7 h per week for each course, so 10 to 12 h in total. The courses were assessed via a combination of coursework and exam, however, participation in these separate assessed activities was not included in the data set analysed here, only their final 2-year grade was used as an indication of their performance.

To highlight the applicability of the method, we also applied the analysis to two additional data sets: (i) time-stamped task completion series from a second cohort of 46 post-experience learners pursuing the post-graduate part-time management degree at Imperial College London; (ii) time-stamped data of ‘page-clicks’ (not equal to task completion) from 100 learners undertaking Open University courses (OULAD data set30). For further details on these data sets, see the Supplementary Information.

Ethical approval from the Education Ethics Review Process (EERP) at Imperial College London was attained (EERP 1718-032b) and a waiver for informed consent was granted for this study.

Construction of the learner similarity graph using a dynamic time warping kernel and RMST sparsification

Creating a similarity matrix between learners using dynamic time warping

To compute the similarity between the task completion time traces of every two learners i and j, we use a similarity kernel, i.e., a generalised inner product. Common approaches for sequence analysis use Lp norms (when p = 2 we obtain the Euclidean norm), which are fast to compute and easy to index. However, their one-to-one matching often ignore sequential patterns that are non-linearly misaligned. Instead, our approach uses a dynamic time warping (DTW) kernel, which provides an elastic matching of two time sequences incorporating both the sequential ordering of the trajectory and the absolute values of time.17 The DTW similarity kernel is defined as:

$$k_l(x,y) = e^{ - D_l(x,y)/\sigma ^2},$$
(1)

where Dl denotes the DTW distance. The distance Dl is calculated by constructing an n × m matrix where n and m are the lengths of the two vectors we wish to compare. Using the pair-wise cost cost(xi, yj) = ||xiyi||2, we minimise the overall cost over the path from (i, j) = (1, 1) to (i, j) = (n, m) where each cell (i, j) along the path contributes cost(xi, yj) to the cumulative cost (summed over the path). This method is able to implicitly stretch both sequences to get a single dynamic time warping match between the two vectors, i.e., we find the cost required to match the two time-series trajectories for each learner. The higher the cost, the higher distance in Hilbert space, and therefore the lower similarity between learners.

For N learners we produce an N × N similarity matrix A where each element Aij is the DTW similarity (1) between learners i and j. For longer time-series and for larger number of learners N, whereby the DTW calculations may become computationally expensive, dimensionality reduction methods can be implemented to improve the speed of similarity calculations36 or segmented dynamic time warping algorithms with comparable speeds to Euclidean distances can be used.37

Creating a similarity graph using RMST sparsfication

The similarity matrix A can be thought of as the adjacency matrix of a fully connected, weighted graph, where every learner is connected to every other learner in the network with a different strength given by their pair-wise similarity. The high redundancy present in this full similarity matrix both increases the computation time and reduces the effectiveness of many clustering algorithms. We therefore sparsify the similarity matrix to produce a similarity graph by reducing the number of edges present. To do this, we employ a pruning algorithm (the Relaxed Minimum Spanning Tree, or RMST), which is based on geometric graph heuristics that preserves edges based on both their strength and their relevance to long paths within the graph. RMST has been shown to balance the local and global structure of data sets and performs well under multiscale graph clustering methods.21,38 Supplementary Fig. 4 shows that the community structure is relatively stable when the sparsification parameter of RMST is varied.

Visualisations and layouts of the similarity graphs for the different data sets were produced using Gephi with the Force Atlas setting.39

Finding clusters of learners using Markov Stability graph partitioning

Community detection methods for graphs aim to partition the nodes of a graph into subgraphs (communities) that are well-connected within themselves and weakly connected to each other. There are multiple ways to define communities, and many methods and criteria to score the resulting partitions.40 Such methods are also related to graph partitioning problems.

Markov Stability (MS) is a generalised method for identifying communities in graphs at all scales. MS employs a random walk on the graph to define a time-dependent cost function that measures the probability that a random walker is contained within a subgraph over a time scale t. If the random walker becomes trapped in particular subgraphs over that particular timescale, this identifies a good partition. As the time scale of the Markov process increases, the method identifies larger subgraphs leading to coarser partitions. Hence MS has the ability to identify intrinsically relevant communities at all scales by using the dynamic scanning provided by the diffusive process. For a detailed description of the method see.22,23

The random walk is governed by the N × N transition matrix Q = D−1A, where N is the number of nodes in the graph, A is the adjacency matrix, and D = diag(A1) is the degree matrix where 1 is a vector of ones. Q defines the probability of the random walk transitioning from node i to node j, as given by the discrete-time process:

$${\mathbf{p}}_{t + 1} = {\mathbf{p}}_t\,Q,$$
(2)

where pt is a 1 × N node vector describing the probability of the random walker to be at each node at time t. An associated continuous-time diffusive process in terms of the graph combinatorial Laplacian L = D − A has the time-dependent solution:

$${\mathbf{p}}(t) = {\mathbf{p}}(0)\,e^{ - tL}.$$
(3)

The time t is denoted the Markov time and is distinct to any real time. Markov time can be understood as a dimensionless quantity related to the diffusive process, which acts as a resolution parameter in that it allows for the exploration of the graph at different scales: as the Markov time increases, the partitions become coarser.

A partition of the graph into c communities is encoded into a N × c membership matrix establishing the correspondence between the nodes and the clusters:

$$H_{ic} = \left\{ {\begin{array}{*{20}{l}} 1 \hfill & {{\mathrm{if}}\,{\mathrm{node}}\,i\,{\mathrm{belongs}}\,{\mathrm{to}}\,{\mathrm{community}}\,c} \hfill \\ 0 \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.$$
(4)

The goodness of the partition encoded by H at time t under the dynamics governed by L is defined in terms of the c × c block auto-covariance matrix:

$$R(t;H) = H^T({\mathrm{{\Pi}}}e^{ - tL} - {\mathrm{\pi }}^T{\mathrm{\pi }})H,$$
(5)

where π is the stationary solution of (3) and Π = diag(π). The meaning of this matrix is clear: the element of the matrix [R(t; H)]αβ encodes the probability that a random walker starting in community α will be at community β after time t, and the diagonal elements, [R(t, H)]αα, indicate the probability of remaining contained in community α over time scale t. Hence a good partition H will maximise the sum of the diagonal elements, i.e., the trace of R(t, H). This leads us to our definition of the cost function, the Markov Stability of the partition:

$$r(t,H) = \mathop {{\min }}\limits_{\tau < t} \,{\mathrm{Tr}}\,\left[ {R(\tau ,H)} \right],$$
(6)

which is to be maximised at every time t by searching in the space of partitions H:

$$r^ \ast (t) = \mathop {{\max }}\limits_H r(t,H)\,{\mathrm{and}}\,H^ \ast (t) = \mathop {{\arg \max }}\limits_H r(t,H).$$
(7)

Owing to the optimisation (7) being non-convex and NP-hard, we use an efficient greedy algorithm known as the Louvain algorithm,41 which has been shown to perform well in practice and against benchmarks. Given its greedy nature, the optimised partition found by Louvain is not always the same as it depends on the initialisation of the optimisation algorithm. Therefore, we repeat the optimisation \(\ell = 100\) times using different starting points for the algorithm. For each Markov time we thus obtainn 100 optimised partitions \(H_i^ \ast (t)\) and we pick the one with maximal Markov Stability (6) in the set as the optimal partition at t:

$$\mathop {{\max }}\limits_i \{ H_i^ \ast (t)\} _{i = 1}^\ell = \widehat H(t).$$

To identify the important partitions across time, we use the following two robustness criteria:23

Consistency of the optimised partition

A relevant partition should be a robust outcome of the optimisation, i.e., the ensemble of \(\ell\) optimised solutions should be similar. To assess this consistency, we employ an information-theoretical distance between partitions: the normalised variation of information between two partitions \({\cal{P}}\) and \({\cal{P}}^{\prime}\) defined as:42

$$VI(H,H\prime ) = \frac{{2\,{\mathrm{\Omega }}(H,H\prime ) - {\mathrm{\Omega }}(H) - {\mathrm{\Omega }}(H\prime )}}{{{\mathrm{log}}(N)}},$$
(8)

where \({\mathrm{\Omega }}(H) = - \mathop {\sum}\nolimits_{\cal{C}} p ({\cal{C}}){\mathrm{log}}\,p({\cal{C}})\) is a Shannon entropy, with \(p({\cal{C}})\) given by the relative frequency of finding a node in community \({\cal{C}}\) in the partition H, and Ω(H, H′) is the Shannon entropy of the joint probability. The variation of information VI(H, H′) [0, 1] is a true metric distance between two partitions based on information theory and VI(H, H′) = 0 indicates that two partitions are identical.

A measure of the robustness to the optimisation, at a given Markov time t, is given by the average variation of information of the ensemble of solutions obtained from the \(\ell\) Louvain runs:

$$VI(t) = \frac{1}{{\ell (\ell - 1)}}\mathop {\sum}\limits_{i \ne j} {VI} (H_i^ \ast (t),H_j^ \ast (t)).$$
(9)

If all runs of the optimisation return similar partitions, then VI(t) is small, indicating robustness of the partition to the optimisation. Hence, we select partitions with low values (or dips) of VI(t).

Persistence of the partition across levels of resolution

Relevant partitions should also be optimal across stretches of Markov time. Such persistence is indicated both by a plateau in the number of communities over t and a low value plateau of the cross-time variation of information:

$${\mathrm{VI}}({\mathrm{t}},{\mathrm{t}}^{\prime} ) = VI(\widehat H(t),\widehat H(t^{\prime} )).$$
(10)

This provides a second measure of robustness of a partition across resolution scales, and is commonly visualised via a heatmap where blocks along the diagonal indicate partitions that are persistent. Within a time-block of persistent partitions we choose the most robust partition, i.e., with lowest VI(t).

Markov Stability code available at github.com/michaelschaub/PartitionStability. When the computation of the matrix exponential in (5) becomes costly for moderately large N, the linearisation of etL provides an efficient approximate method to analyse very large graphs within the same framework.

Isotonic regression

An isotonic regression is a model that identifies the optimal least squares fit to a data set given the constraint that the model must be a non-decreasing function. The optimisation is:

$$\mathop {{\arg \min }}\limits_x \left| {y - x} \right|^2,$$
(11)

where xi must be larger or the same as xi−1, i.e., x0 ≤ x1 ≤ ... ≤ xn. The algorithm looks for violations of monotonicity and adjusts the estimate to fit within the constraints.

Gaussian process regression

The Gaussian process regression (GPR) was implemented using the sklearn Python package. The implementation is based on the Algorithm 2.1 of Gaussian processes for machine learning (GPML) by Rasmussen and Nickisch.28

A GPR model can be thought to define a distribution over functions and inference being undertaken directly on the space of functions. As such, a mean and variance that models the data can be calculated. Given that the GPR is probabilistic we can calculate the log-likelihood of any set of trajectories being derived from an optimised GPR on another set of trajectories. Bayes factors are a method of Bayesian model comparison, which quantify the support for a model over another model. The Bayes factor K for two models M1 and M2 given some data D is:

$$K = \frac{{Pr(M_1|D)}}{{Pr(M_2|D)}}\frac{{Pr(M_2)}}{{Pr(M_1)}}$$
(12)

Additional classification algorithms

To classify learners into high, medium and low-performance groups, we used an SVM and a Decision Tree. Both algorithms are commonly used in classification tasks and were implemented using the scikit learn Python package.29

  • An SVM acts as a non-probabilistic binary linear classifier that attempts to find a hyperplane in a high or infinite dimensional space that maximises the distances between data points of differing classes. We implemented the SVM with the radial basis function kernel.

  • The Decision Tree attempts to find optimal branches (decisions) that represent conjunctions of features that lead to accurate prediction of class labels. We implemented a Decision Tree depth of four branches, increasing the number of branches did not improve the classification accuracy.

Instead of using regression analysis between continuous dependent variables (performance) and independent variables (temporal features), we implemented classification algorithms to provide a closer comparison to our clustering results.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.