Benchmarking
Test 1. Raw event log.
Generated a log in classic format: process instance ID, stage name, completion timestamp. A total of 20,500 records, 1,600 process instances. Hidden inside were 200 typical anomalies: loops, bottlenecks, redundant stages, etc. The model received the table "as is" — with no hints or formatting. Scoring: +1 point for each correctly identified case, −0.25 for a false positive. The final result was calculated as a percentage of the maximum possible 200 points. If the model remained silent three times in a row, returned "everything looks perfect," or produced some nonsense, a score of zero was assigned.

Test 2. Regulation with noise.
A text document describing a process: stages, roles, transition conditions. However, it also contained a lot of "filler" content loosely related to the process itself. Hidden within this "regulation" text were 100 inefficiencies. The same scoring system applied — the share of identified cases was calculated, with penalties of −0.25 points for each hallucination.

Test 3. Visual diagram.
A PNG file with a BPMN diagram: 20 blocks, approximately one hundred transitions. The diagram contained 20 logical errors: loops without exit conditions, unused gateways, redundant routes, dangling stages, etc. The same scoring system applied — +1 point for each identified case, −0.25 for a glitch — and the percentage of all hidden issues found was calculated.

The final score was weighted as follows: 80% — log analysis (event log table), 10% — the "regulation" text, 10% — the BPMN diagram image.
LLM in Business Process Analysis