multiprocessmining.org is an initiative to gather use cases and develop technologies for process mining over
multiple objects and entities,
multiple actors,
multiple systems,
multiple levels,
…
multiprocessmining.org is an initiative to gather use cases and develop technologies for process mining over
multiple objects and entities,
multiple actors,
multiple systems,
multiple levels,
…
a use case, a dataset, a tool, a paper, a presentation, a case study, an opinion on Multi… Process Mining you want to share and highlight?
The process mining landscape is evolving rapidly – both in industry and in research. But our education has not kept up. The “(still) very successful “original” Process Mining MOOC Process Mining in Action has been extremely successful in teaching the foundations of Process Mining to a broad audience of practitioners, students, and researchers aiming to get familiar with the field.
Since then, many new courses covering Process Mining have become available, both from vendors and the Covid-enabled video recordings of BPM and Process Mining courses uploaded to Youtube.
What is sorely missing in this landscape however is more structured education around the most recent, fundamental trends in process mining. At the Process Mining Summer School and at the BPM 2022 conference, I gave two lectures that summarized 10 years of research on Multi-Dimensional Process Analysis. The reactions were very positive and enthusiastic. At the same time, a 1.5 hour lecture is not a proper course. It lacks the exercises, tutorials, and longer timeline for learning, which we can offer to students enrolled in Master-level courses on Advanced Process Mining. Spurred by a comment on LinkedIn I chose to do something about this.
While this is an experiment, I think it is worth a try…
Taking the material I have collected and built over the last years, I have started building an Online Course: Multi-Dimensional Process Analysis. Starting today, I am gradually releasing “chapters” covering the following 12 topics in the coming weeks and months:
Each part comes with a video, open access literature, exercises and tutorials. It is aimed at (PhD) students, researchers from other fields, and specifically process mining professionals who want to learn more about one the most important developments in the process mining field.
This course is essentially a hypermedia online textbook. It does not require to enroll on any platform – but there is also no certificate for completing it, except for the knowledge you have gained in becoming a better process mining expert. Interested? Then head over to the course: it’s up and running.
This post summarizes my talk on “Data Storage vs Data Semantics for Object-Centric Event Data” I gave at the XES symposium at the ICPM 2022 conference.
I first provide a bit of background and the problem we faced, before I turn to the issue of separating data storage from data semantics.
Process Mining analysis starts when someone made event data available in a form that can be directly loaded into a process mining solution. Providing data suitable for analysis is a very labor-intensive process in itself, often taking the lion share of time until the first insights are obtained. Event data exchange formats are supposed to lower the hurdle by providing a common set of semantics concepts that all process analysis problem share.
The existing XES standard suffers from the limitation of pre-grouping events under a single case identifier, which makes it hard to use for analysis questions on many real-life systems. Aware of this problem, the process mining community has started a working group to develop a more versatile event log standard, currently called Object-Centric Event Data (OCED). A first intermediate result of this discussion is a meta-model that illustrates which key concepts we need to consider in data exchange.
The OCED meta-model tries to strike a careful balance between having a simple standard and adding more expressivity to event data exchange formats that includes:
Increasing the expressiveness of the data exchange format with concepts to describe how events operated over an entity-relational data model has an undesired consequence: the meta-model to describe all possible ways of how data can be operated on has to suddenly incorporate a large amount of semantic concepts. At the bare minimum, this is creating, reading, updating, deleting objects, attributes, and relations individually and together. At the same time, using just the bare minimum of concepts immediately reveals use cases that cannot be naturally expressed, asking for even more semantic concepts.
How to escape this dilemma was the topic of my talk, which I developed over many discussion with Marco Montali.
The fundamental point I am making is that the meta-model for data exchange in process mining has to separate the data storage from the data semantics.
Which problem are we solving with data exchange?
We want to help an analyst solve an analysis task for a real process. The analysis fundamentally is to understand the “ground truth” of the dynamics in the real world and to map it to an analysis artifact (digital image, shadow, twin, …) that is consistent with reality for the specific analysis task. A major complication in understanding processes operating on data is that these operations come at different levels of granularity:
Because inspecting reality is costly, we pass the data recorded by the underlying IT systems to process mining analysis tools which can reconstruct this digital image of reality. Again, that image has to be consistent with the ground truth of the real process to be of value.
The data exchange format has to allow this – no more, but also no less.
Unfortunately, the data storage in the underlying systems is optimized for usage: entities and objects are distributed over multiple tables, there are no events, just time-stamped records, relations can be expressed in many ways, activities are materialized in various forms. Moreover, data recording is “lossy” from the perspective of a historic analysis: certain data is never recorded or recorded incorrectly, other data is discarded when it’s no longer needed for operations.
As a consequence, the recorded data may (in parts) be inconsistent with the ground truth:
While these inconsistencies are ideally resolved by improving the data storage of the underlying IT systems, in reality most systems will store inconsistent data. As “consistency with the ground truth” is depending on the analysis task, it is impossible to fully resolve it when exporting the data into the data exchange format.
I therefore argue that the data exchange format should facilitate fast data exchange and rather write out data as-is (as it’s issues and problems are also known). A fundamental property of this data export must be
This doesn’t tell us yet how the exported data should be structured. For this, we should look at what we need to do with the data.
Even the most basic, classical process mining analysis already separates data semantics from data storage. Classical process mining starts from building a classical event log. The de-facto standard are CSV files where one column holds a timestamp and other columns essentially hold a number of attribute-value pairs. To turn this CSV file into an event log, the analyst has to pick one attribute that as case identifier and another attribute as activity.
And the analyst has to make this choice in line with the analysis task and what makes the data (more) consistent with the ground truth. This may require to combine multiple attributes to define the right case identifier to refine activities by using multiple attributes.
The space of choices simply becomes larger when we start including dynamics over data in the question.
We now have to make choices for how to model the dynamics of various entities (data objects, actors, equipment, …), relations, and attributes over various levels of granularity (from basic data operations via transactions to actions and entire processes). Each real-world process has its own, specific data structures. The analysis fundamentally has to reconstruct relevant aspects of these data structures (consistency with domain knowledge) in line with the analysis task. This requires flexibility during the import – specifically because different analyses may focus on different aspects.
This suggests that we should generalize the choice of activity and case identifier we do in the classical case to a description of how the recorded and provisioned data maps to the domain knowledge. This would provide the analyst with the design space to build their analysis.
We can create a separate semantic header that is separate from the raw exported data exported. This header has to how the stored data maps to the domain data model of the process.
The storage format for the raw exported data has at least to be:
On top of this, a domain export provides a semantic header that consists of
This semantic header is not tied to the specific data export. It can be provided beforehand, but also be revised and updated after data export as the analysis evolves or better domain information becomes available. This allows to gain more insights on previously exported data without having to re-export the data again.
To enable process mining over multiple objects, the semantic header has to to express at least which attributes and values describe objects, identifiers, the relevant relations between them – and which attributes belong to an object (are stored somewhere) and which attributes belong to an event (are just observations of a context that is not further stored). This is essentially what the OCED meta-model tries to achieve.
How could demonstrate the feasibility of this semantic header for solving this data exchange problem? The top-part of the above figure sketches how we could specify a mapping from attribute-value pairs of a raw data export to the basic concepts of the OCED meta-model:
This mapping only grounds the objects in the recorded data. We still need to map out the semantics of the operation on the data, which is represented in the OCED model as qualifiers of the relations between events and objects, attributes, and relations. This could for example be done by rules over the attributes such as: If event has type “Create Order” and refers to an object of type Order, then this object is created by the event.
While these semantic concepts can be grounded in a single event record, other concepts may need grounding in multiple event records, such as: which events together from a batch, which events belong to the same activity (start and end events), or which events modify a relation that was created previously.
Some of these process dynamics will be common among many processes, and also be stored using very similar or identical data structures. Others will differ strongly among processes and be represent very differently. By developing the key concepts for specifying semantic headers of process event data, we give an analyst the flexibility to specify the exact semantic model required for the specific process and analysis – without polluting the data storage format with the specifics of the process.
As a community, we have the obligation to help standardize semantic concepts of process that are shared by many processes. The OCED model is arguably the shared baseline. We should start here and create a few proof-of-concept implementations of semantic headers and functions for OCED that transform basic CSV (or OCEL) files into data analysis structures, specifically for object-centric process mining and event knowledge graphs.
Also this year, we are organizing the 3rd International workshop on Event Data and Behavioral Analytics (EdbA’22) at the ICPM’22. Part of our workshop philosophy is to dedicate half of the time to open discussions on the topics of the workshop. The discussion often turns to what makes the problems we are looking at difficult – and it’s usually hidden assumptions and under-researched topics. The second and third session on “Deviation Analysis” and “…beyond Control-Flow” surfaced the following topics (for which I managed to record a few short bullet points and questions).
See also The hidden assumptions and forgotten topics of Process Mining, part 1 (Human Behavior) for my notes on “Human Behavior”.
Typical structure in deviation analysis: first detect where did something happen we didn’t expect, then explain why this happened. Do we have to separate both steps or should they be combined/integrated?
How do we pick the right approach for detecting anomalies?
Adding more features to event data and analytics (data attributes/properties) makes the analysis more complex.
The common topic of these above questions is that our standard approaches lack concepts and data structures for modeling the context of a particular behavior that is outside the specific case. The two dimensions of Multi-Dimensional Process Thinking of the inner and the outer scope of a process (analysis) may help to think about these concepts and data structures by taking a different perspective on processes.
Also this year, we are organizing the 3rd International workshop on Event Data and Behavioral Analytics (EdbA’22) at the ICPM’22. Part of our workshop philosophy is to dedicate half of the time to open discussions on the topics of the workshop. The discussion often turns to what makes the problems we are looking at difficult – and it’s usually hidden assumptions and under-researched topics. The first session on “Human Behaviors” surfaced the following topics (for which I managed to record a few short bullet points and questions).
Classical event logs have the fundamental shortcoming of describing process behavior only in isolated process executions from the viewpoint of a single case entity. Most real-life processes involve multiple entities and process executions are not isolated but tightly connected – through actors and entities involved in multiple executions. This post summarizes how event graphs are a better model to describe and query such dynamics. I explain the basic ideas and how to construct a graph from an ordinary event table. By Dirk Fahland.
Classical event logs – the basis for all process mining – use a case identifier attribute to study the behavior in a trace: group all events with the same case identifier value and order them by time. We then can study the temporal ordering of process activities along the sequence of events to get insights into the process. The most basic temporal relation is the directly-follows relation: in a trace <e1,e2,e3,e4,e5,…> event e4 directly follows event e3.
But most real-life event data has more than one case identifier attribute. This forces to either leave out events or directly-follows relations or to squeeze events under a case identifier where they don’t really belong.
Consider the event table shown below. It has the standard event log attributes: action, time(stamp), but lacks a case identifier column that is defined for every event.
Most real-life event data is like this in their source system. The event data above most likely originates from three different tables
Creating a classical event log that contains all the events is impossible – unless one is willing to accept false information such as event duplication, or false directly-follows relations. For example, if we engineered a case identifier that separates events according to the horizontal line in the above table,
All this together makes it hard to answer the following question: Which problems preceded (possibly caused) the final Update on Laptop 24?
In the following, we explain how a graph-based structure based on the concept of Labeled Property Graphs allows to naturally describe this data. With the graph-based model, we can avoid introducing false information and better answer questions like the cause for the final update on Laptop 24.
A labeled property graph has nodes and edges (called relationships).
By using relationships between nodes in a graph, each Event can be correlated to multiple Entity node and each Event nodes can have multiple preceding and succeeding Event nodes (via DF) allowing model multiple DF-paths (one per entity).
Here is how you can construct such an event graph using these concepts.
Every event record in the input event table becomes an Event node. Every attribute of the event record becomes a property. We get the following graph of just Event nodes with properties.
In classical event log construction, we now would pick a unique case identifier column.
In event graph construction, we now identify which attributes refer to entities – and there can be multiple in the data. We pick the ones we are interested in.
This results in the following graph.
Note how most events are correlated to more than one entity.
In classical event logs, all events related to the same case identifier are ordered by time into a trace.
In an event graph, all events correlated to the same Entity node (via CORR edges) are ordered by time and we create a directly-follows DF edge between any two subsequent events. We store in the DF edge properties for which entity each DF edge holds. Doing this for the Entity node for Ticket = 11 results in the following graph.
We obtain a DF-path over three events from Problem Report via Investigate to Close Ticket. This DF-path states: “Entity Ticket=11 observed the sequence <Problem Report, Investigate, Close Ticket>”.
We have to do this for each Entity node for which we want to study behavior. The image below shows all resulting DF-paths over all events for all entities, but for readability we show only the CORR edge to the first event of a path.
Done. This graph fully describes the behavioral information contained in the original event table using local DF-relations per entity.
Looking at the DF-paths, we can directly see
The striking difference in complexity of paths for the three different component types can be explained by the very different nature of the entities.
We do see the different natures back in the way the different DF-paths synchronize in the graph.
There is more behavioral information that can be inferred in this graph, for example the dependency of Close Ticket for Ticket=12 on the final Update event of Laptop 24. Doing so requires to infer relationships between two Entity nodes and to materialize a relationship as a new derived Entity node, say between Ticket 12 and Laptop 24. We then can analyze the behavioral dependencies between two entities. How to do this is explained in this paper: https://doi.org/10.1007/s13740-021-00122-1
Event graphs based on labeled property graphs can be stored in a graph database system such as neo4j, which allows to query the graph using a graph query language.
For example, if we want to answer our earlier question “Which problems preceded (possibly caused) the final Update on Laptop 24?” we can write the following Cypher query
MATCH
(e1:Event {Action:”Problem Reported”})-[:DF*]->(e2:Event {Action:”Update”}),
(e1) -[:CORR]-> (t1:Entity {type:”Ticket”}),
(e1) -[:CORR]-> (c1:Entity {type:”Component”}),
(e2) -[:CORR]-> (c2:Entity {type:”Component”})
WHERE
NOT EXISTS (e2) -[:DF]-> (:Event)
RETURN e1,e2,t1,c1,c2
This query matches a DF-path from a Problem Reported event e1 to an Update event e2 where e2 may not have any subsequent event (there is no outgoing DF relationship). It returns e1,e2 and the ticket t1 and component c1 correlated to e1 and the component c2 related to e2.
Applied on our example graph, it will return two matches as highlighted below: e1 matches once with Problem Reported for Ticket 11 and once with Problem Reported for Ticket 12.
Note that the first matching path only uses DF-edges of one entity (Laptop 24) while the second matching path uses DF-edges of two different Entities (Change C4 and Component 36).
The above query is very limited: it works only for one specific Update event, it does not return the actual paths, and it does not collect the different results. All of these limitations can be overcome by extending the Cypher query.
The following open access paper https://doi.org/10.1007/s13740-021-00122-1
A complete implementation for importing 5 public event logs into event graphs is available here: https://doi.org/10.5281/zenodo.3865221
Here are 5 fully prepared event graphs from 5 different public event logs, both as Neo4j database dump and as GraphML file to import and analyze.
This paper uses event graphs to identify task patterns in processes by analyzing how DF-paths along cases and DF-paths along resources synchronize. https://doi.org/10.1007/978-3-030-85440-9_13
This paper uses event graphs to model and analyze patient care pathways with multi-morbid patients. https://arxiv.org/abs/2110.00291
Process Mining over multiple perspectives, dimensions, objects, entities, abstraction levels, etc. is gaining a lot of attention in the process mining field in 2021. If you want to learn more about the emerging topics in Multi… Process Mining, this short guide has you covered for the workshop day (1st Nov 2021) of the 3rd International Conference on Process Mining 2021 in Eindhoven https://icpmconference.org/2021/. By Dirk Fahland
I may have missed a talk. I made this overview is to the best of my understanding of the various announcements or prior knowledge on the presenters’ works. Let me know if I missed your talk that is also related to Multi… Process Mining.
ML4PM http://ml4pm2021.di.unimi.it, 1st Nov 09:00 – 10:15
Mahsa Pourbafrani, Shreya Kar, Wil van der Aalst and Sebastian Kaiser. Remaining Time Prediction for Processes with Inter-Case Dynamics, https://twitter.com/pads_rwth/status/1447848232986480640?s=21
Andrea Chiorrini, Claudia Diamantini, Alex Mircoli and Domenico Potena. Exploiting Instance Graphs and Graph Neural Networks for next activity prediction, https://iris.univpm.it/handle/11566/292094
PODS4H, https://pods4h.com, 1st Nov, 9:00-10:15
“Verifying Guideline Compliance in Clinical Treatment using Multi-Perspective Conformance Checking: a Case Study.” By Joscha Grüger, Tobias Geyer, Martin Kuhn, Ralph Bergmann and Stephan Alexander Braun.
ML4PM http://ml4pm2021.di.unimi.it, 1st Nov 11:00 – 12:15
Mahmoud Shoush and Marlon Dumas. Prescriptive Process Monitoring Under Resource Constraints: A Causal Inference Approach, https://arxiv.org/pdf/2109.02894v1.pdf
EdbA http://edba.science/program/, 1st Nov, 11:00-12:15
Tobias Brockhoff, Merih Seran Uysal, Isabelle Terrier, Heiko Göhner and Wil van der Aalst. Analyzing Multi-level BOM-structured Event Data.
SA4PM https://sa4pm.win.tue.nl/2021/program/ 11:00-11:25
PErrCas: Process Error Cascade Mining in Trace Streams, Anna Wimbauer, Florian Richter, and Thomas Seidl, https://sa4pm.win.tue.nl/2021/wp-content/uploads/sites/3/2021/10/ICPM2021_paper_150.pdf
PODS4H, https://pods4h.com, 1st Nov, 11:00-12:15
“Discovering care pathways for multi-morbid patients using event graphs.” By Milad Naeimaei Aali, Felix Mannhardt and Pieter Jelle Toussaint. https://twitter.com/milad_n_a/status/1446112238759153664?s=21
PQMI2021, http://processquerying.com/pqmi2021/, 1st Nov, 14:00-15:00
Celonis PQL: A Query Language for Process Mining, Jessica Ambros
PQMI2021, http://processquerying.com/pqmi2021/, 1st Nov, 15:45-16:05
An Event Data Extraction Approach from SAP ERP for Process Mining, Alessandro Berti, Gyunam Park, Majid Rafiei and Wil van der Aalst, https://twitter.com/pads_rwth/status/1447552080831471626?s=21
EdbA http://edba.science/program/, 1st Nov, 15:45 – 17:15
Plenary Workshop Discussion on behavioral analytics and event data beyond classical event logs.
The only problem with this fully packed workshop program is: it’s impossible to follow all talks as a single person.
There is a lot more on Multi… Process Mining coming on the other days of the conference https://icpmconference.org/2021/events/ in research, tool demos, the industry day l, the XES 2.0 workshop, and at the vendor booths.
By Dirk Fahland.
Here is a list of activites by the Process Mining vendors at ICPM that I have seen in the program that – in my view – will have a relation or show use cases for Multi… Process Mining.
It seems that Thursday 9th October 12:30-14:00 is high-noon for Multi…Process Mining of the process mining vendors at ICPM.
I may have missed some relevant events – if I did, let me know in the comments.
By Dirk Fahland.
Analyzing processes supported by an ERP system such as Order-to-Cash or Purchase-to-Pay are one of the most frequent use cases of process mining. At the same time, they are one of the most challenging, because the processes operate on multiple related data objects such as orders, invoices, and deliveries in n:m relations.
Event log extraction for process mining alays flattens the relational structures into a sequences of events. The top part of the following poster illustrates what goes wrong during this kind of event log extraction: events are duplicated and false behavioral dependencies are introduced.
A possible way to prevent this flattening is to extract one event log per object or entity in the process: one log for all orders, one log for all invoices, one log for all deliveries. The result is a so-called artifact-centric process model that shows one “life-cycle model” describing the process activities per data object.
But analyzing the process over all objects also requires to extract event data about how objects “interact”. Technically, this can be done by extracting one event log per relation between two related data objects (or tables). From these, we can learn the flow and behavior dependencies over different data objects.
Decomposing the event data in this way into multiple event logs ensures that event sequences either follow one concrete data object or follow a concrete relation between two related related data objects. The resulting model only contains “valid flows”.
By Dirk Fahland.
In this post, I show how simple data visualizations help understanding the multi-dimensional nature of processes. Even the most simple classical business processes have important dynamics that cannot be understood by cases in isolation.
The Performance Spectrum was originally designed to deliver a useful process map for analyzing logistics processes over time. When applying the same technique to business process event data, the performance spectrum proves equally useful as it unveils process characteristics that were so far hidden by existing process mining tools:
To support and trigger further research in this area, we released a smaller-scale version of the Performance Spectrum Miner enabling the analysis of business process event logs. The platform independent (requiring JRE8) tool is available as a
Below, we show some performance spectra of publicly available event logs.
In the figure below, we see the typical process map or model of the Road Traffic Fines Management Process event log on the left. It describes the possible behavior of handling a single case (a traffic ticket). The arc width and annotations tell how long it takes a case to go from one activity to the next.
On the right we see the Performance Spectrum of this process. Each horizontal stripe describes the transition between two activities – called a segment. Each colored diagonal or vertical line is a case passing through this segment over time. The longer and more diagonal the line, the longer the case took.
We can immediately spot very different patterns in each of the segment, clearly showing that the cases are not handled in isolation, but something manages their progress. We can see
Looking at a single event log, we can see that even a classical process over a single entity (a traffic ticket) is subject to dynamics beyond the scope of a single case.
We can see
What can you find in the Performance Spectra of the other public event logs, or your own data? Get the Performance Spectrum Miner from http://www.promtools.org/ or from https://github.com/processmining-in-logistics/psm and try it out!
One of the core challenges of process analytics from event data is to enable an analyst to get a comprehensive understanding of the process and where problems reside.
In business process mining such an overview is obtained with a process map. It can be discovered from event data to visualize the flow in the process and highlight deviations and bottlenecks.
Process maps of logistics processes do not give these insights: they are too large to comprehend, the maps do not visualize how processing of materials influences each other, and – as they show an aggregate of all event data – they fail to visualize how performance and processing varies and changes over time.
In the “Process mining in Logistics” project by Eindhoven University of Technology and Vanderlande, we therefore developed a new visual analytics technique which we call the Performance Spectrum:
The image below shows the performance spectrum of a baggage handling system along a sequence of check-in lines over time. Bags are put into the system at point a1 and then are moved via conveyor belts to point a2. Each blue or orange line in the top-most segment a1:a2 in the performance spectrum shows the movement of one bag from point a1 to point a2 over time. The angle (and color) of the line indicates its speed.
As shown on the layout schema below, further bags enter the system from another check-in point a2 and are also moved to point a2, where both flows merge on the segment a2:a3, etc. All bags eventually reach the point “s” from where the bags are routed further into the baggage handling system. In the performance spectrum, we can see the movement of a bag over these segments through the consecutive lines.
As bags cannot overtake each other on a conveyor belt, we can immediately identify in the performance spectrum several behavioural patterns:
The visualization allows process managers and engineers to both quickly locate the cause of the problem to prevent it happening in the future. In particular the briefly-visible performance problem in a2:a3 prior to the halt of the conveyor belt can be identified as an early warning signal to detect possible performance problems in the future, and also to understand and improve system recovery behavior.
We realized this technique in a high-performance visualization tool which we call the Performance Spectrum Miner. It has proven itself reliable to:
We released a smaller-scale version of the tool (as a ProM plugin or as standalone tool) together with a manual on https://github.com/processmining-in-logistics/psm.
More information