Process discovery with context-aware process trees

https://doi.org/10.1016/j.is.2020.101533Get rights and content

Highlights

  • Context-aware process trees (CaT) are defined.

  • An inductive context-aware discovery algorithm (CaDi) is proposed.

  • CaDi produces trees that are context consistent, deterministic, and complete.

  • CaDi discovers models of higher quality than the state-of-the-art.

  • CaDi can be tuned to provide CaTs that are more explainable.

Abstract

Discovery plays a key role in data-driven analysis of business processes. The vast majority of contemporary discovery algorithms aims at the identification of control-flow constructs. The increase in data richness, however, enables discovery that incorporates the context of process execution beyond the control-flow perspective. A “control-flow first” approach, where context data serves for refinement and annotation, is limited and fails to detect fundamental changes in the control-flow that depend on context data. In this work, we thus propose a novel approach for combining the control-flow and data perspectives under a single roof by extending inductive process discovery. Our approach provides criteria under which context data, handled through unsupervised learning, take priority over control-flow in guiding process discovery. The resulting model is a process tree, in which some operators carry data semantics instead of control-flow semantics. We show that the proposed approach produces trees that are context consistent, deterministic, complete, and can be explainable without a major quality reduction. We evaluate the approach using synthetic and real-world datasets, showing that the resulting models are superior to state-of-the-art discovery methods in terms of measures based on multi-perspective alignments.

Introduction

Process discovery plays a key role in descriptive, predictive, and prescriptive data-driven process analyses [1]. From depicting the underlying processes [2], [3], through providing a basis for predictive monitoring [4] and conformance checking [5], to evidence-based resource scheduling [6], discovered models serve as the workhorse of process-aware data analytics.

In recent years, the focus of process discovery has been on extending the control-flow perspective to additional perspectives that represent the context of process execution, e.g., related to time and costs. Most multi-perspective discovery methods proceed in two phases [1]: (1) eliciting the control-flow, and (2) enriching the discovered models with context information. This separation may result in accuracy loss when data-based quality measures are concerned. To demonstrate the challenge, consider a business process from the life of a University faculty Prof. X, who works 7:00–20:00 and his efficient assistant, Wolverine (W), who works 9:00–12:00 (an event log snippet is presented in Table 1). Prof. X, being a research traditionalist, scribbles his research notes on paper (activity A) and sends it to his assistant (activity B), who types the notes using a computer (activity C). When Wolverine heads home, Prof. X must perform voice-to-text typing on his own (activity D).

Using a control-flow-oriented discovery algorithm, e.g., the inductive miner [3], for this scenario yields the process tree in Fig. 1(a). Clearly, the algorithm fails to uncover the dependencies between control-flow and execution context, such as time and resources, being guided solely by the activity labels of the events in the log. The alternative we offer in this work is a context-aware approach, which yields a process tree as shown in Fig. 1(b). Our algorithm first splits a log based on context information, discovering distinct processes at different times of the day (colored yellow and gray in Table 1). This is reflected in Fig. 1(b), where a node in the tree uses a dedicated operator that carries data semantics: It describes processes that are either executed before or after 14:05. Processing the sublogs further, potential data-based splits (e.g., based on the resource information, colored dark and light gray in Table 1) turn out to be insufficiently “significant” to take precedence over the control-flow perspective. Hence, control-flow dependencies govern the construction of the subtrees.

The above observation is also important for performance-oriented process discovery. Based on the tree in Fig. 1(a) and the timestamps recorded in Table 1, we may arrive at the conclusion that the average duration of the process is 13.75 min. However, based on Fig. 1(b), significant differences are observed for executions before 14:05 (average duration of 8.3 min) and later (average duration of 30 min). Hence, the ability to differentiate the two scenarios increases the accuracy of the discovered model, when used for performance analysis. In Section 2.3, we further show the splitting the log using context results in higher quality splits.

To improve the accuracy of context-aware process discovery, we propose an inductive discovery technique (termed CaDi) that elicits both the control-flow and the data perspective, simultaneously. To this end, we combine the divide-and-conquer approach of the inductive miner [3] with ideas from unsupervised learning. For discovering context, we define context-aware process trees (termed CaT), a representational bias that captures both control-flow and data-driven decisions. We further show that our algorithmic discovery solution produces CaTs with desirable three properties, namely, consistency, completeness, and determinism. Finally, we discuss the explainabilty of such models and describe a methodology to enhance it.

We use multi-perspective alignments [7] to measure the quality of discovered models (via precision, fitness, and generalization) to demonstrate, empirically, that our model yields higher quality than a “control-flow first” approach. In addition, we discuss the complexity and explainabilty of the discovered models as established from the input to CaDi.

Our main contribution is fourfold, as follows.

  • 1.

    Defining context-aware process trees and set a desiderata for fundamental properties of such trees (Section 3).

  • 2.

    Introducing an inductive context-aware discovery algorithm to create context-aware process trees that are consistent, complete, and deterministic (Section 4).

  • 3.

    Enhancing the explainabilty of context-aware process trees (Section 5).

  • 4.

    Providing an empirical evaluation based on synthetic and real-world datasets to demonstrate the value of our approach (Section 6).

In addition, we discuss the relation of our work to state-of-the-art literature in Section 7 and conclude in Section 8.

This paper extend our earlier work [8], setting forward a desiderata for fundamental properties of context-aware trees (Section 3.2) and proving that these properties are maintained by our discovery algorithm CaDi (Section 4.2). Also, in this work we address the issue of explainability and describe a methodology to enhance it (Section 5). Additionally, we experimented with an additional dataset (see Section 6.2) and expanded the empirical evaluation by analyzing the effect of predefined parameters on the discovered model complexity and explainabilty in Section 6.3 and Section 6.5, respectively.

Section snippets

Background

We first define our data model (event logs) and the adopted process modeling formalism (process trees), and briefly outline the approach of the inductive miner [3] in discovering process trees from event data. Subsequently, we discuss log clustering and the silhouette evaluation measure.

Context-aware process trees

This section introduces context-aware process trees (CaT) as our representational bias, i.e., as the target formalism for our discovery algorithm. After we give a formal definition of a context-aware process tree (Section 3.1), we discuss several desirable properties for such a tree (Section 3.2). We later revisit these properties when introducing a new discovery algorithm.

Inductive context-aware process discovery

We now introduce our algorithm for inductive context-aware discovery (dubbed CaDi). It combines inductive control-flow discovery and clustering of events to construct a CaT, a context-aware process tree as introduced in Section 3.

Explainability vs. optimality

Explainability and interpretability became mandatory ingredients in any machine learning-based solution [16]. Explainability and interpretability are often used interchangeably despite their varying roles in presenting the outcome of machine learning solutions. Interpretability is about the extent to which a cause and effect can be observed within a system (“I understand that if time<14:05 I should execute A,B,C). Explainability is the extent to which the internal mechanisms of a system can

Empirical evaluation

In this section, we present an empirical evaluation of CaDi. Our main results show that

  • CaDi yields higher quality models when considering the data perspective than a “control-flow first” (and data second) approach, and,

  • CaDi makes less data-oriented errors compared to a “control-flow first” approach.

  • The explainable version of CaDi maintains explainable CaTs without a considerable reduction in data split quality.

First, we present the experimental setup, including performance measures,

Related work

Process discovery (see [29] for a review) has been an active research area for many years. The inductive miner is one of the well-established discovery algorithms, guaranteeing soundness and fitness for the discovered models [3]. Alternatively, the split miner [30] operates on the directly-follows graph induced by an event log and combines a technique to filter infrequent behavior. Our proposed algorithm follows the divide-and-conquer approach of the inductive miner, as it also splits the event

Conclusions and future work

In this work, we proposed an algorithm for context-aware process discovery (CaDi), which considers both control-flow and context, simultaneously. The algorithm discovers context-aware process trees, which we define as a suitable representational bias to capture both control-flow and data perspectives. Specifically, CaDi combines inductive mining and K-means clustering to generate process models that capture context information explicitly through a constraint operator.

CaDi creates a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (46)

  • MannhardtF. et al.

    Balanced multi-perspective checking of process conformance

    Computing

    (2016)
  • ShragaR. et al.

    Inductive context-aware process discovery

  • LeemansS.J.J. et al.

    Discovering block-structured process models from event logs containing infrequent behaviour

  • J. MacQueen, et al. Some methods for classification and analysis of multivariate observations, in: Proceedings of the...
  • XuR. et al.

    Survey of clustering algorithms

    IEEE Trans. Neural Netw.

    (2005)
  • RozinatA. et al.

    Decision Mining in Business Processes

    (2006)
  • Business Process Model and Notation (BPMN) Version 2.0Tech. rep.

    (2011)
  • DošilovićF.K. et al.

    Explainable artificial intelligence: A survey

  • RibeiroM.T. et al.

    Why should i trust you?: Explaining the predictions of any classifier

  • DyJ.G. et al.

    Feature subset selection and order identification for unsupervised learning

  • BuijsJ.C. et al.

    Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity

    Int. J. Coop. Inf. Syst.

    (2014)
  • De LeoniM. et al.

    Data-aware process mining: discovering decisions in processes using alignments

  • MannhardtF. et al.

    The multi-perspective process explorer

    BPM (Demos)

    (2015)
  • Cited by (0)

    View full text