Introduction

The rapid development of information systems leads to a growing amount of information [1] that obtains massive stored data [2, 3]. In the presence of various types of data, events that are happened in the systems are stored in so-called event log [4]. Monitoring business processes from a massive event log brings a challenge related to Big Data. Process mining is a discipline of gathering the event log and processing the log into a process model for monitoring, including capturing anomalies [5, 6] or bottlenecks [7]. The technique of constructing a process model by process mining is called process discovery.

There are several types of relationships which are formed by process discovery algorithms. The sophisticated relationship is invisible tasks in non-free choice. This condition occurs when choice activities which involve invisible tasks are not a choice-free relation (their execution depends on previous activities). This research presents two business processes to give usage overview of invisible tasks in non-free choice.

A port implements two processes, i.e., handling imported goods and handling exported goods. In handling imported goods, several cranes lift containers on to the truck based on the types of containers. Cranes can put dry containers in the truck directly. On the other hand, there are previous tasks of moving reefer and uncontainer. For cranes to carry out appropriate activities, the process model must have makers for mapping container determination and tasks to put on the truck. Those markers are called non-free choice. Each non-free choice relationship connects an activity with another activity; however, dry containers did not have a specific task before lifting on the truck. Therefore, an invisible task is added to raise a non-free choice relationship for each type of containers. This condition is called invisible tasks in non-free choice (denoted as a black circle that relates to a non-free choice relationship in Fig. 1).

Fig. 1
figure 1

A process model of port container handling

Another example is shown in Fig. 2. An invisible task should be added before finish transaction activity, so a non-free choice that connects payment with e-facture to print tax invoice and a non-free choice that connects direct payment to finish transaction are depicted in a process model of purchasing medical equipment. A black circle with a non-free choice relationship is an implementation of invisible tasks in non-free choice.

Fig. 2
figure 2

A process model of purchasing medical equipment

The process discovery algorithms have been developed rapidly to discover several types of relations [8,9,10,11,12,13]; however, seldom algorithms concern with invisible tasks in non-free choice. \(\alpha ^{\$ }\) [14] is the development of \(\alpha\) algorithm that discovers a new issue, i.e., invisible tasks in non-free choice. \(\alpha ^{\$ }\) stores the dependencies of activities in the form of pairs of activities and collects traces of activities based on the event log and combines \(\alpha ^{{ + + }}\) rules and \(\alpha ^{\# }\) rules to construct invisible task in non-free choice. \(\alpha ^{\$ }\) has several rules for ordering relations. The rules check all pairs of activities because there is no information of relations of those activities. The checking process produces high computing time.

Graph-based process discovery algorithm was developed to minimize checking all pairs of activities for discovering a process model. This algorithm utilizes capabilities of a graph database to store activities and their relationships. Instead of observing pairs of activities, the graph-based process discovery algorithm observes their relationships directly. This method is effective because a relationship can be detected from other relationships. For example, the graph-based algorithm discovers parallel relationships based on occurrences of sequence relationships [15]. An invisible task is detected based on parallel relationships. Storing relationships can accelerate the computing time for discovery.

Existing graph-based process discovery algorithms have several drawbacks. First, they did not concern to discover invisible tasks in non-free choice. Secondly, those algorithms do not directly store event logs of the system in a graph database. Those algorithms need to convert the event log to graph database.

According to the difficulty of existing graph-based process discovery algorithms, the main contribution of this research is to propose GIT, the expansion of Graph-based Invisible Task, for discovering invisible tasks in non-free choice by extending rules of existing graph-based algorithms and integrating an Enterprise Resource Planning (ERP) system with a graph database to store the event logs directly in the graph database.

GIT algorithm is evaluated based on the quality of obtained process models and the scalability of algorithm. GIT algorithm is compared with \(\alpha ^{{^{\$ } }}\) and \(\alpha ^{{ + + }}\) algorithms based on fitness, precision, generalization, and simplicity [16] to measure the quality of the process models. Purchasing medical equipment processes are used to evaluate the quality of the model.

The time complexity and computing time are used to measure the scalability. There are several steps to measuring the computing time. First, a graph database that is used by GIT is compared with other databases, i.e., SQL and MongoDB, to stores streaming event logs and batch event logs. Secondly, the computing time of GIT algorithm is compared with the time of \(\alpha ^{{^{\$ } }}\) and \(\alpha ^{{ + + }}\) based on several event logs with different number of activities. The aim of using several event logs to gain the capacity of processes which can be handled by those algorithms.

Related works

Existing methods of process discovery

Nowadays, many methods of process discovery are developed. There are two well-know methods, i.e. \(\alpha\) [17] and Heuristic Miner [8].

\(\alpha\) algorithm forms all unique sequences of activities into a process model. This algorithm is expanded to other algorithms to handle several issues in discovering relationships of activities. \(\alpha ^{ + }\) [18] add rules of \(\alpha\) for handling looping activities. Furthermore, \(\alpha ^{{ + + }}\) [14] and \(\alpha ^{\# }\) [19] expands rules of \(\alpha ^{ + }\) with different purposes. \(\alpha ^{{ + + }}\) focuses to depict non-free choice; however, \(\alpha ^{\# }\) appends additional activities, called invisible prime tasks, in a process model to discover particular situations, i.e. skip, redo and switch. Thereafter, \(\alpha ^{{^{\$ } }}\) [4] introduces invisible tasks in non-free choice and combines \(\alpha ^{{ + + }}\) with \(\alpha ^{\# }\) to overcome that recent issue. \(\alpha\) and its expended algorithms do not consider noises, so those algorithms are suitable for noise-free event logs.

Heuristic Miner [8] is process discovery algorithm that modifies \(\alpha\) by adding a threshold to filter activities which will be formed in a process model. There are two aims of adding a threshold. First, the action can simplify an obtained process model based on logs containing huge activities and relationships. Activities which have appearance times less than the threshold will be abandoned, so the obtained process model only discover frequently executed activities. Secondly, the disposal of activities with occurrences less than the threshold can eliminates noises because noises raise rarely. Fodina [11] improves Heuristic Miner to create robust heuristic process discovery algorithm. In addition, Fodina also develops rules to handle invisible prime tasks.

\(\alpha\) algorithm, as a pioneer process discovery algorithm, inspires the formation of new methods. HMM-Parallel Tasks [20] and CHMM-Invisible Tasks [21] modifies rules of \(\alpha\) in the form of Hidden Markov Models. There are also graph-based \(\alpha\) algorithm, such as Graph-based Parallel [15] and Graph-based Invisible Task [22]. Other algorithms are developed, such as Inductive Miner [9] and RPST [10]. Table 1 describes the existing algorithms which can discover types of relations depicted by the columns. The tick signs explain that the algorithms can discover the related relations depicted by the columns.

Table 1 Algorithms of process discovery

According to Table 1, existing graph-based process discovery algorithms had not concern in invisible task in non-free choice. GIT algorithm as proposed algorithm of this research handles this type of relation.

Event log

An event log consists of several cases, which identify the executed processes. Each case has several attributes, such as identification of case (CaseId), name of activities (Activity), the execution time of the activities (Timestamp), and actors who carried out activities (Resource) [23]. For example, an event log \(\varepsilon\) = { {C1, \({\text{A}}\), 2018-08-05-15:54, Admin}, {C1, \({\text{B}}\), 2018-08-05-16:54, Customer}, {C1, \({\text{D}}\), 2018-08-05-16:54, Customer}, {C2, \({\text{A}}\), 2018-08-06-15:54, Admin}, {C2, \({\text{B}}\), 2018-08-06-16:54, Customer}, {C2, \({\text{D}}\), 2018-08-06-16:54, Customer}, {C3, \({\text{A}}\), 2018-08-07-15:54, Admin}, {C3, \({\text{C}}\), 2018-08-07-16:54, Customer}, {C3, \({\text{D}}\), 2018-08-07-16:54, Customer}. The event log \(\varepsilon\) has three cases with id C1, C2, and C3. Both C1 and C2 have the same sequence of activities, which is \({\text{A}} \to {\text{B}} \to {\text{D}}\) and C3 has \({\text{A}} \to {\text{C}} \to {\text{D}}\).

An event log contains several traces, which are unique sequences of activities. The event log \(\varepsilon\) has \(\left( {{\text{A}} \to {\text{B}} \to {\text{D}}} \right)_{2}\) and \(\left( {{\text{A}} \to {\text{C}} \to {\text{D}}} \right)_{1}\), where 2 and 1 identify the occurrence numbers. There are two traces of \(\varepsilon\), i.e., \(\left( {{\text{A}} \to {\text{B}} \to {\text{D}}} \right)\) and \(\left( {{\text{A}} \to {\text{C}} \to {\text{D}}} \right).\)

Control-flow patterns

The process model consists of two types of relations, namely sequence relations and parallel relations. Sequence relations are used to describe processes that run straight in sequence or sequentially, whereas the parallel relations are used to describe a branched process. There are three types of control-flow patterns [24] to represent parallel relations, namely AND, OR, and XOR [20, 25]. Table 2 shows examples of relations and differences between control-flow patterns in graph models.

Table 2 Control-flow patterns in process model

Methodology

This research proposes an event log storage algorithm into a graph-database for direct process discovery without exporting event logs from the ERP database and transforming .csv file into .mxml or .xes. The storage algorithm will make the discovery process more efficient than previous methods. Figure 3 shows a comparison of the previous method with the proposed method. The first one shows the steps of process discovery proposed by Aalst [17, 26]. The event log needs to be extracted from the database in the form of .csv. Then, the .csv file needs to be converted into .mxml or .xes by using Disco Tool. .Mxml file is supported by ProM 5, and .xes file is supported by ProM 6. Then, the .mxml or .xes file is imported to ProM Tool to conduct process discovery. The second one shows process discovery steps using graph-database Neo4J proposed by Darmawan et al. [27]. The event log needs to be extracted from the database in the form of .csv. Then, the .csv file is imported to Neo4J to conduct process discovery. The last one shows the steps for discovering the process model of the proposed method. By integrating Neo4J and ERP using Laravel, the event log stored in the database ERP is directly processed to execute process discovery.

Fig. 3
figure 3

The comparison process discovery methods. a The pioneer of process discovery method by Aalst et al. [17, 26], b the process discovery using graph-database [27], and c GIT process discovery model

The integration stage

The integration stage is connecting the ERP system with Neo4J, a graph database platform, in storing an event log of the system directly in the form of a graph database. The used ERP system in this research is built by Laravel. This research uses ClientBuilderfrom the GraphAwareLibrary to start the integration and ClientBuilder to make a connection between the system and the graph database.

Process discovery

  1. a.

    The algorithm of logging stage

    The logging stage in a graph-database starts when the user logs into the ERP system until the user logs out. The recording is done using the algorithm in Table 3. First, the algorithm checks the log that has been saved in Neo4J. If the last log is found in the graph-database, the login activity is entered in the last case + 1. However, if the last log is not found, then the login activity is recorded as the first case. The Neo4J connection is made using the GraphAwarelibrary by connecting ClientBuilder from the GraphAware. At that moment, the current case will be included in the session to be used in later stages.

  2. b.

    Constructing sequence relation and invisible task

    Table 3 The algorithm of logging stage

    Table 4 is used to discover sequence relations of the event log. GIT stores all variations of activities as nodes in the graph database. Then, GIT creates the sequence relations of those nodes based on activities of the event log (called an event). A sequence relation is discovered between an event and its next event with the same case id.

    Table 4 The algorithm of GIT for discovering sequence relations

    Cypher queries in Table 5 are a part of the GIT method. Based on Table 5, the steps of detecting skip, switch and redo invisible tasks are same. GIT algorithm chooses relations of activities based on its rules and inserts invisible tasks in the middle of those relations. The algorithm is implemented before control-flow pattern discovery.

    Table 5 The algorithm of GIT for invisible prime task discovery

    The next step is the discovery of the control-flow patterns, i.e., AND, OR, and XOR, as can be seen in Table 6. Control-flow patterns, that is described in Table 6, are used to determine types of parallel relations. Parallel relations can be divided into three types, which are XOR, AND, and OR. XOR relations only allows one activity to be executed in a process. AND relations allow all activities to be executed. Lastly, OR relations are relations between XOR and AND, and it only allows parallel activities that are not categorized into XOR and AND relations. Each relation is divided into two forms, which are split and join. A split happens when there are multiple activities to be chosen. A join happens when multiple activities that can be chosen in split relations are closing into one or more activities.

    Table 6 The algorithm of GIT for discovering control-flow patterns

    The last step is the discovery of invisible tasks in non-free choice. Non-free-choice relation connects an activity in one choice relation with another activity in the next choice relation. This relation shows that the next activity of choice relation cannot be freely chosen and is influenced by the activity of the previous choice relation. Figure 2 is an example of non-free-choice relation and discovery by using GIT. As depicted in Fig. 2, a node Print Tax Invoice and an additional node, i.e., Invisible Task, cannot be freely chosen even though the relation between Make Payment to Invisible Task and Make Payment to Print Tax Invoice are XORSPLIT; instead, the choice of the node Print Tax Invoice can only be taken if the previous activity taken is a node Payment with E-Facture and choice of the Invisible Task can only be taken if the previous activity taken is a node Direct Payment. Non-free-choice relation is represented by NONFREECHOICE relation. Using the algorithm in Table 7, GIT algorithm finds some nodes with outgoing relation of XORJOIN, some nodes with ingoing relation of XORSPLIT, detect invisible task, and lastly match them against some nodes inside activity label which has the same name.

    Table 7 The algorithm of GIT for discovering invisible task in non-free choice

Results and discussion

Results

This research evaluates GIT algorithm based on two aspects, which are the quality of obtained model and the scalability. The quality is measured based on fitness [28], precision [29], generalization, simplicity. The fitness value is obtained by counting the number of cases described in the model divided by the total cases in the event log. The precision value is obtained by counting the number of traces discovered in the model divided by the number of traces present in the event log. Generalization and simplicity are measured based on equations in [16].

The scalability is measured according to computing time and maximum number of traces algorithm could handle. This research divides computing time as computing time of storing event logs and computing time of discovering process models. A graph database as the chosen database of GIT is compared with SQL and MongoDB to measure computing time of storing logs. The discovery computing time of GIT is compared with those times of other algorithms: \(\alpha ^{\$ }\) and \(\alpha ^{{ + + }}\).

There are four event logs which are used in the evaluation. The first event log is generated from a medical store that already implemented Enterprise Resource Planning (ERP) system. The reason of choosing this event log is the processes are executed in the ERP system which is integrated with graph-based platform, i.e., Neo4j by using integration methods (explained in “The integration stage” section). Thereupon, the event log is noise-free. GIT algorithm refers to rules of \(\alpha\) algorithm, so this algorithm is suitable with noise-free logs. The other event logs are obtained from BPI Challenge [30,31,32]. Those event logs are chosen to measure scalability because they consist of high number of traces and cases.

The result of GIT algorithm to discover processes in the first event log is shown in Fig. 4. The full name of activities presented in Fig. 4 is shown in Table 8. In Fig. 4, the non-free choice relations occur between activity Direct Payment and an invisible task and between activity Payment with E-facture and Print Tax Invoice. Thus, this research is verified to be able to detect invisible task in non-free choice.

Fig. 4
figure 4

The process model by GIT algorithm

Table 8 List of activity names initialized in Fig. 4

Figure 5 shows the comparison of results from the previous methods with the method proposed in this study. All fitness and simplicity values of all algorithms are 1.000. \(\alpha ^{{ + + }}\) obtains lowest precision value, at 0.8644. On the other hand, \(\alpha ^{{ + + }}\) gains highest generalization value, at 0.9682. Both of \(\alpha ^{\$ }\) and GIT has same values for precision and generalization, accounting for 1.000 and 0.9681, respectively.

Fig. 5
figure 5

Result comparison between the previous methods and the proposed method

In addition to comparing the quality of obtained results, the time complexity of the algorithms is measured. The time complexity of \(\alpha ^{\$ }\) is \(O(n^{4} )\) [33], while the time complexity of GIT is \(O(n^{3} ),~\) with the explanation steps as listed in Table 9.

Table 9 Comparison of algorithm complexity

Tables 10 and 11 show the result of storing computing time and discover computing time based on six event logs. Those event logs are three event logs from BPI Challenge, processes of a medical store, and chosen processes of the medical store.

Table 10 Result of storing computing time
Table 11 Result of discovering computing time

This research calculates storing computing time by two types of event logs, i.e., batch event logs and streaming event logs. MySQL obtains lower computing times in batch event logs with average 1.092 s and streaming event logs with 0.0068 s. On the other hand, graph database has highest storing computing time in batch event logs, at 6071.5 s and has low storing computing time, at 0.0730 s. Contrary to graph database, MongoDB has low storing computing time of batch logs but highest time of streaming event logs.

Based on Table 11, GIT can handle more numbers of event logs than \(\alpha ^{\$ }\) and \(\alpha ^{{ + + }}\). The highest number traces which can be discovered by GIT algorithm is 981. On the other hand, \(\alpha ^{\$ }\) and \(\alpha ^{{ + + }}\) can discovery processes with maximum number 99 traces.

Discussion

Quality of obtained process model

The quality result of obtained process models by GIT, \(\alpha ^{{ + + }}\) and \(\alpha ^{\$ }\) is presented in Fig. 5. \(\alpha ^{{ + + }}\) has the lowest precision because this algorithm cannot describe non-free choice relationships, so several traces generated by its process model are not stored in the event log. Those traces decrease the precision value of \(\alpha ^{{ + + }}\). GIT and \(\alpha ^{\$ }\) has high precision because those algorithms can depict invisible tasks in non-free choice. Generalization values of GIT and \(\alpha ^{\$ }\) are smaller than \(\alpha ^{{ + + }}\) because invisible tasks.

Scalability

This research measures the scalability based on storing processes and discovery processes. The results of Table 10 shows graph database can handle huge event logs; however, the consuming time is fantastic. In batch event logs, the storing computing time by graph database is 6000 times of MySQL and 59 times of MongoDB. The long storing computing time by graph database will resulting in a spike in overall GIT runtime. Optimizing storing computing time by graph database is an important thing that must be studied further in the future.

Table 11 shows the discovery computing time and also maximum traces algorithm could discover. GIT algorithm has highest number of traces that can be handled among other algorithms; however, the limit traces of GIT is only 981 traces. The main reason of GIT inability to discover more than 900 traces because GIT determine sequence relationships based on all cases, not all traces. In the future work, clustering cases to obtain traces is needed before discovery processes are executed.

Conclusion

This research proposes an algorithm called GIT algorithm, which utilizes a graph database to model invisible tasks in non-free choice efficiently. The capability of storing relations in a graph database can simplify the rules of process discovery because several relations of a process model can be constructed based on other discovered relations. GIT algorithm improves Graph-based Invisible Task by combining non-free choice rules with invisible task rules from Graph-based Invisible Task.

This research compares GIT algorithm to \(\alpha ^{\$ }\) and \(\alpha ^{{ + + }}\) based on the qualities of their models. The chosen quality measurements are fitness, precision, generalization and simplicity. The experiments show that GIT and \(\alpha ^{\$ }\) has highest values of fitness and precision, which are 1. This research calculates computing time and time complexity of GIT and \(\alpha ^{\$ }\) and computing time of graph database, MySQL, and MongoDB. Graph database gains highest storing computing time of batch event logs; however, this database obtains low storing computing time of streaming event logs. The complexity of \(\alpha ^{\$ }\) is \(O(n^{4} )\); meanwhile, the complexity of GIT is \(O(n^{3} )\). Based on 99 traces, GIT algorithm discovers a process model 42 times faster than \(\alpha ^{{ + + }}\) and 43 times faster than \(\alpha ^{\$ }\). GIT algorithm can also handle 981 traces, while \(\alpha ^{{ + + }}\) and \(\alpha ^{\$ }\) has maximum traces at 99 traces. It can be concluded that the process discovery of GIT algorithm is more efficient than that of \(\alpha ^{\$ }\) algorithm, and the graph database is suitable on the streaming event log.

The future work of this research is optimizing storing processes in the graph database and clustering processes as the input of discovery processes. Those subjects should be analyzed further to increase the scalability of GIT algorithm.