When referring to Root Cause Analysis (RCA), it is understood to be a process through which one can find “the cause” that generically leads to what is called “failure“.
A failure in a piece of equipment is the condition in which it cannot fulfill its purpose or task according to the originally intended requirements, whether due to lack of efficiency, safety issues, or decreased quality. In extreme cases, it can simply mean that the equipment has stopped working. Furthermore, equipment failure can result in serious consequences, such as damage to other assets, impact on people, or harm to the environment, etc.
The scope of RCA has surpassed its original domain and, given its potential, has expanded into diverse fields such as culture and resources in companies, quality management, occupational safety, project management, and medicine.
Specifically, when dealing with issues in physical assets, we enter the classic field of Fault Analysis. An undesired condition can arise as a result of a specific event or the convergence of several events.
There are no unique definitions for the processes involved in RCA, which are designated or described in different ways. In general, an extensive enumeration of examples is chosen to define and communicate these concepts. This is because attempting to systematize fault analysis processes through a structure of definitions is often tedious or inefficient.
Contact Our Specialists Today
We will assist you by combining the best professionals, highly trained and experienced, with the use of the latest technology, all aligned with the best management practices.
Click the link below to schedule a meeting
Contact Form
Events can be identified as relatively short-duration occurrences that negatively impact or modify the operational state of a specific system. Examples of such events include overpressure in a reactor, the breaking of a seal in a bearing, the rupture of a link in an anchoring chain, or the presence of fire in a fuel storage tank.
On the other hand, undesirable conditions typically persist over time and, in general, without significant variations: a bearing operates at an elevated temperature, a building has structural damage, a pump experiences product loss through seals, or a pipe vibrates excessively.
The last event that highlights the equipment’s incapacity and generates significant damage is usually identified as “the failure.” When the equipment failure, in turn, leads to severe damage, it is labeled as catastrophic.
Failures do not occur spontaneously, but rather result from both events and other undesirable conditions that, as mentioned, have different durations and incidences.
It is crucial to identify and understand the events and conditions that lead to failures, the failure mode, and the factors that contributed to it in order to implement preventive and corrective measures. Equipment failures can be prevented or reduced through proper management of operation, maintenance, inspection, timely detection and rectification of problems, and repair. The focus of these measures is based on a deep understanding of the causes of failure.
It is always imperative to be able to answer questions such as: What can be done to prevent the same thing from happening again? This is related to reducing risks, improving operation, and evolving in design.
Similarly, other questions related to contractual and economic aspects, such as: How long was the problem gestating? Is the event unexpected, or could it have been detected? Was everything possible, necessary, and normatively obligatory done to prevent the disaster? also require the same level of understanding of the problem.
Techniques have been developed to systematize and organize information, knowledge, and activities aimed at addressing issues related to “failure.” These techniques seek to establish a cause-and-effect relationship between events and conditions in a temporal sequence.
At the beginning of the RCA, all known and possible events and conditions according to technically acceptable cause-and-effect relationships are proposed and captured in a chart called Root Cause Tree. This chart incorporates both facts and assumptions. The investigative or analytical tasks generated aim to verify these assumptions and transform them into facts, as well as to discard hypotheses not supported by the results.
Image: Example Root Cause Tree Structure
In the retrospective analysis of events, it is possible to identify the origin of the problem, commonly referred to as the “root cause.” The entire process, by extension, is identified as Root Cause Analysis or RCA. This designation seems to imply the existence of a single root cause. But is there really a root cause? What generally exists are events or undesirable conditions that, sequentially or concurrently, lead to the final or most relevant event defined as failure. The need to define a root cause often arises from the legal requirement to determine the direct or proximate cause of the damage.
Root causes are often identified as the conditions that trigger a sequence of events leading to failure. This sequence would not be interrupted by the elimination of other intervening causes.
These other causes, whose elimination would not have prevented the failure, are usually identified as factors or contributing causes. Their presence generally has the effect of making the action of the root cause more effective.
RCA investigations involve the use of a set of specific terms and concepts. Some of the terminology belongs to the technique in general, but there are some concepts that have specific meanings in the context of Fault Analysis cases. Many of them come from the English language and lose content in translation.
Perhaps the most unfortunate of these definitions is to label inherent causes of organizational processes as latent. Latent means something that exists but has not been discovered or brought to light. In this sense, any category of cause can remain in a latent state. It is preferable to refer to these directly as organizational causes.
Another assertion made a priori is that “latent causes” promote human causes, and these, in turn, physical causes. This is not necessarily true, but the simplification it proposes makes it tempting when defining the structure of events in an RCA analysis.
Failures that are recurrent, systemic, and critical must be addressed in depth, and an RCA provides the widest possible approach to understanding the causes and their solution. A Root Cause Analysis can take a long time; therefore, it is not recommended for every failure or incident.
When should an RCA be performed? The management of immediate failure control after an incident and the execution of corrective actions are different processes from RCA. Only after the situation is resolved, mitigation actions are implemented, and personnel are safe, should an RCA be conducted. Until that moment, personnel independent of the mentioned actions must ensure the preservation of evidence.
The root cause analysis should be implemented with the formation of a team, involving individuals with experience in leading the analysis, as well as specialists and personnel familiar with the facilities, equipment, and systems involved. As far as possible, the team should be independent. The degree of independence, size, and composition of the team are matters to be decided by the RCA leader and will depend on the complexity, type, and severity of the event being evaluated.
There are many methods to organize and present information, which can be adopted within the development of an RCA, with different names and varying in complexity and scope.
Many of these methods have been implemented through software and applications. It should be noted that their use does not generate knowledge in any case. Their greatest potential is to reveal what is unknown, separating facts from conjectures and organizing the investigative tasks that must be carried out to validate the hypotheses proposed regarding what happened.
All relatively complex teams or processes have, from their design, barriers that prevent the spread of undesirable conditions to more critical situations. In the development of an RCA, it is important to highlight the exceeded barriers. The analysis focuses on identifying the barriers that should have prevented the problem but did not, and why existing controls and safeguards did not function as expected. This can reveal other risk conditions associated with the incident.
A comprehensive and well-founded RCA should result in a report with clear conclusions, without open flanks. The challenge is to move forward proactively while maintaining objectivity based on facts.