Visualization and Clustering of Business Process Collections Based on Process Metric Values

Motivated by ideas of software measurement, the area of process measurement has attracted attention in recent time. Numerous process metrics have been proposed to measure (often structural) properties of business processes.In this paper, we propose heatmaps, a visualization technique for high-dimensional data originally used in genetics, for visualizing the process metric values of business process collections. So, new insights into the distribution of the metric values among the processes could be gained. Additionally, we use clustering for analyzing (1) the correlations between different process metrics and (2) finding (structurally) similar processes among business process collections. Our approach has been successfully applied to the SAP Reference Model processes.


Introduction
During the previous decades, the field of software measurement has created theoretical concepts for measuring software and making predictions on software qual-This work is licensed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Germany License. To view a copy of this license, visit http://creativecommons.org/ licenses/by-nc-nd/3.0/de/.
ity attributes (see, e. g., [4] for an overview). Motivated by this research, several papers proposing process metrics have been published in recent years. These metrics measure (often structural) properties of business processes and can be used to characterize and compare processes. Integrated into valid prediction systems, they can be useful for predicting external process attributes like duration, costs, number of errors or understandability. As this area of research is quite young, not much knowledge about the behavior of these metrics (e. g., distribution of metric values and correlations between metrics) exists.
To gain new insights into these questions, the visualization and analysis of the process metric values of large business process collections would be interesting. The resulting process metric data would be very highdimensional making visualization problematic.
In this paper, we propose heatmaps, a visualization technique for high-dimensional data originally used in genetics, for visualizing the process metric values of business process collections. So, new insights into the distribution of the metric values among the processes could be gained.
Additionally, we use clustering for analyzing (1) the correlations between different process metrics and (2) finding (structurally) similar processes among business process collections. The clustering does not consider behavioral similarity as, for example, in [13].
Finally, we apply our approach to the SAP Reference Model processes.
The remainder of this paper is organized as follows: In Section 2, we give a short overview about the area of process measurement. The use of heatmaps for visualizing the high-dimensional process metric data of business process collections is explained in Section 3. In Section 4, we present basics on clustering. The results of an experimental application of our approach are given in Section 5. The paper gives a conclusion and presents possible future work (Section 6).

Process Measurement
The area of process measurement is inspired by the works and results of software measurement. Several papers proposing process metrics have been published in recent years (see [7, pp. 1-2] for an overview).
According to Fenton and Pfleeger, there are two main types of measurement: Definition 1 (Measurement systems) Measurement systems are used to assess an existing entity by numerically characterizing one or more of its attributes [4, p. 104].
Definition 2 (Prediction systems) Prediction systems are used to predict some attribute of a future entity, involving a mathematical model with associated prediction procedures [4, p. 104].
Besides the use for future entities, as stated in the definition of Fenton and Pfleeger, prediction systems can also be used to predict some attribute of an existing entity that is measurable only in a very laborious manner.
In [7], we show how the idea of prediction systems can be transfered to process measurement (see  A process has internal and external attributes. Internal attributes can be measured purely in terms of the process separate from its behavior [4, p. 74]. Most proposed process metrics measure structural properties (internal attributes).
External attributes can be measured only with respect to how the process relates to its environment [4, p. 74]. Examples are costs, duration, number of errors and understandability.

Heatmaps
The process metric data of (large) business process collections is high-dimensional data with many data vectors. So, the problem arises how to visualize this data.
Several existing methods are available, but all of them have big disadvantages: • Scatter plots (see Figure 2 for an example) are good for visualizing large amounts of data vectors. But they are only applicable for 2D or at most 3D data. • Radar charts 1 (see Figure 3 for an example) are drawn in two dimensions and can display data with three or more dimensions. For each dimension, there exists an axis. The axes start in one single center point and are uniformly placed around the 360 • of a circle. The points on these axes form a polygon representing one vector. Radar charts soon become confusing when increasing the number of dimensions and depicting many data vectors. To overcome these problems, we propose the use of heatmaps, a visualization technique originally used in genetics for depicting microarray data. Recently, this method was adapted to visualizing the individuals (i. e., possible solutions) of population based multi-objective algorithms (e. g., genetic algorithms) [12].
A heatmap displays the data as a matrix: one row per data vector and one column per dimension (see Fig  Heatmaps have many advantages compared to other visualization methods for high-dimensional data: Large amounts of data can be clearly displayed on one page. Correlations between different dimensions and the distribution of the values of the different dimensions become visible. For our case, the process metric values of a process are displayed in one row. The different process metrics form the columns of the matrix. External attributes (as duration, costs, number of errors or understandability) can be added as additional columns of the heatmap if desired.

Basics
A good overview about clustering is given by Berkhin in [1].
The general goal of clustering is to partition a set X ⊆ R n of data points into k subsets (clusters) C = {C 1 , . . . , C k }. These clusters are disjoint-their union is equal to the full set of data points.
Two often used methods in practice are hierarchical and partitive clustering. These are explained in more detail in the following subsections.

Hierarchical Clustering
The result of a hierarchical clustering is a so called clustering tree (dendrogram) (see the top of Figure 5 for an example). Each node of this tree has a corresponding cluster. The corresponding cluster of a node is the union of all clusters belonging to this node's child nodes.
Hierarchical clustering can be divided into agglomerative (bottom-up) and divisive (top-down) algorithms for constructing the clustering tree.
In this paper, agglomerative hierarchical clustering is used. The approach is described in pseudo code in Algorithm 1.
{initialize: assign each vector to its own cluster} 4: {compute distances between all clusters} 10: for all C i ∈ C do 11: for all C j ∈ C do 12: if end for 16: end for 17: {merge the two clusters C i and C j that are closest to each other} 18: numberClusters ← numberClusters − 1 21: {store information about two sub-clusters} 22: • complete linkage • average linkage In each of these measures, d( x i , x j ) is a distance measure between the two vectors x i and x j . This could be, for example, the Euclidean distance x 2 of (5).

Partitive Clustering: k-means
The k-means clustering algorithm is a randomized clustering approach that generates a disjoint, nonhierarchical partitioning consisting of k clusters. The algorithm is described in pseudo code in Algorithm 2.
Function KMEANS(X , k) Input: set X of data vectors, number of clusters k Output: clustering C with k clusters 1: C ← ∅ 2: for i = 1 to k do 3: randomly initialize cluster center (centroid) c i 6: end for 7: repeat 8: {compute partitioning for data} 9: for i = 1 to k do 10: C i ← ∅ 11: end for 12: for j = 1 to |X| do 13: add x j to that C i with shortest Euclidean distance between x j and c i 14: end for 15: {update cluster centers} 16: for i = 1 to k do 17: end for 19: until partitioning stays unchanged or the algorithm has converged 20: return C It minimizes the error E(C) with As the k-means algorithm does not depend on previously found sub-clusters, it often results in better clusterings than gained with hierarchical approaches. Yet, as it is a randomized algorithm, its execution is indeterministic-possibly resulting in several different clusterings for the same data set X and value k. So, the question arises how to choose the number k of clusters and how to choose from the different clusterings potentially found for the same number of clusters.
One possible solution to this problem is the Davies-Bouldin index [3] defined as (7) Thereby, S c is defined as and acts as a dispersion measure quantifying the average centroid distance of the cluster's vectors.
The measure d ce is defined as and quantifies the distance between two clusters (centroid linkage).
An optimal clustering consists of "compact" clusters with small dispersion and large distances between the single clusters. Looking at (7), one can easily notice that such an optimal clustering minimizes the value of the Davies-Bouldin index.

Experimental Application of Approach 5.1 Selected Process Metrics
As already stated, numerous process metrics are proposed in the literature. Yet, they require different process representations (e. g., Petri nets, workflow nets or EPCs). In order to compare the process metrics, we had to choose metrics that are applicable for the same process representation. Looking at a recent overview about proposed process metrics [7, pp. 1-2], we chose metrics for EPCs. A business process model (in EPC representation) is a special kind of graph G = (N, A) consisting of a set N of nodes and a set A ⊆ N × N of arcs. There are two node types: tasks T and connectors C (N = T ∪ C). Tasks can be functions F or events E (T = F ∪ E), connectors can be splits S or joins J (C = S ∪ J). Each connector has one of the labels AND, XOR or OR. Each connector c ∈ C has an in-degree d in (c) = |{(n 1 , n 2 ) ∈ A|n 2 = c}|, an out-degree d out (c) = |{(n 1 , n 2 ) ∈ A|n 1 = c}| and a degree d(c) = d in (c) + d out (c).
The 33 selected EPC process metrics are listed in Table 1.

Selected Processes
We selected the SAP Reference Model [2,5], which was part of SAP R/3 until version 4.6, as process collection length of the longest path (= number of arcs on this path) from a start node to an end node density (1) : A cut-vertex is a node whose deletion separates the process model into multiple components. We first validated the EPCs according to the requirements for syntactically correct EPCs [10, pp. 42-46]. Furthermore, we discarded EPCs with several graph components. Out of the 604 non-trivial EPCs of the SAP Reference Model, we had to remove 89 because of invalidity 2 or several graph components.
Finally, 515 EPCs remained for the following experiment with our approach.

Results
The 33 process metric values of the 515 selected processes are depicted in the heatmap of Figure 5.
The values of each process metric are normalized into the interval [0, 1] as their domains are too different. The metrics control-flow complexity (CF C) and join complexity (JC) are logarithmically normalized as both have some outliers with extremely high values compared to the large rest of the values.
The rows (i. e., processes) are ordered by the number of nodes metric (S N ). The columns (i. e., process metrics) are hierarchically clustered using 1 − the Spearman's rank correlation coefficient [11, pp. 42-45] as distance between two columns (process metrics) within the complete linkage inter-cluster distance measure of equation (3).
The data is clearly displayed in the heatmap on one page. So, the main goal of the visualization is fulfilled. Furthermore, several observations can be made: • There is a strong positive correlation between the size metrics number of connectors (S C ), number of events (S E ), number of nodes (S N ) and number of arcs (S A ).
• There is a negative correlation between most metrics (e. g., size metrics) and the metrics separability (Π), sequentiality (Ξ), cross-connectivity (CC), density (1) (∆) and weighted coupling (CP ). The negative correlation is especially strong between S C , S E , S N and S A on the one side and ∆ and CP on the other. A clustered version of the heatmap is depicted in Figure 6. The clustering was done using the k-means clustering algorithm for three clusters. Before clustering, the input data (normalized metric values from the nonclustered heatmap) was scaled to mean 0 and variance 1 for each dimension. The selection of the optimal number of clusters and the optimal clustering with this cluster number for the input data was done using the Davies-Bouldin index.

Conclusion and Future Work
In this paper, we proposed heatmaps as a visualization technique for the high-dimensional process metric data of business process collections to gain new insights into the distribution of metric values among processes. Additionally, we suggested clustering for analyzing the correlations between process metrics and finding (structurally) similar processes among business process collections.
We successfully applied our approach to the SAP Reference Model processes. We could demonstrate that the visualization of 33 process metric values for 515 processes using heatmaps is possible and still clear for a human observer. Furthermore, interesting insights into the correlations between process metrics and the clustering of the processes of the collection could be gained.
For future work in this area, we suggest to apply the approach also to other process collections. It would be interesting to analyze whether these processes have similar correlations between the process metrics and a similar distribution of metric values as the processes examined in this paper.

Pi
Xi  Figure 6: Clustered heatmap displaying 33 process metric values for 515 processes. The rows are separated into three clusters (see bar with gray scale values at the left). The columns are hierarchically clustered using 1 − the Spearman's rank correlation coefficient as distance measure.
[9] Jan Mendling. Testing density as a complexity metric for EPCs.