Root Cause Analysis and Adaptive Case Management

One question I often get from people (analysts more than customers) is regarding the optimization of unstructured business processes. It usually is in the context of BPM and the optimization of structured processes or as Keith Swenson puts it – the problem with Taylorism. Personally, I don’t think we can optimize knowledge processes in the sense that Taylor optimized manual processes – but we can provide a framework and tooling to make the process better. I was thinking about this in the context of root cause analysis for production software systems (believe it or not – for mainframes, but that is another story).

The key problem in root cause analysis is that in most cases by the time the problem is caught, the only signs of the problem are symptoms (artifacts caused by the problem), not the problem itself (which may actually have disappeared.

The question I was asking myself was: Could the root-cause analysis process be optimized?  To anyone that has witnessed what happens when there is an issue with a production system - it is clear that it can be helped, especially with respect to the way it is handled today. If I break the process it down to its steps -find the subsystems that caused the problem, find the person responsible for those subsystems that caused the problem, have them diagnose the problem, propose a solution and then fix the problem. Seems a pretty simple process, and of course optimal. In reality, even though those steps are taken, the real difficulty is what goes on in each step.

Let’s just take the first step in the process – find the sub-system responsible. Since nobody saw exactly what caused the problem, in many cases there is no easy way to know which subsystems are the culprit. At best you can know which subsystems are failing or acting erratically – but those could very well be symptoms, not the root cause. So to solve the problem, everyone relevant huddles in a war room (real or virtual) and starts looking for the root cause (in many cases spending a lot of time trying to explain why it must be someone else’s problem). Lots of communication between the parties involved and lots of relevant logs, printouts and documents are collected. As the problem is narrowed down, fewer people are involved and the search becomes more targeted. In the best case the reason for the problem is found and a fix is decided upon. In the usual case, a set of suspects is generated, tests and probes are created and everyone waits for the problem to happen again (and hopefully for the probes to find the real issue), or sometime to recreate the problem in the lab. More data is collected and fixing the root-cause becomes a long running process that needs to be managed. It is only after the problem is understood and diagnosed can the process become (sort of) structured – scheduling the people and time needed to fix the problem.

ACM can be very useful in streamlining and managing root cause analysis as an unstructured process, managing the communication and the documentation collected – perhaps even providing pointers based on how similar problems were solved (who got involved, documents collected).  But, optimizing the process – I guess I don’t even understand the question…


Link to original posthttp://blog.actionbase.com/