In automation and maintenance engineering fields, reliability, availability, and recovery concepts are used to best define safety and plant operation functions.
Reliability measures the "ability" of devices or production systems to function. This measure is a numerical value defined in terms of probability, expressed on a scale between 0 and 1. Reliability analysis is particularly useful in facilities that use hazardous substances or those subject to "major accident hazards”. Even in facilities that are not subject to major accident hazards, a reliability analysis can have beneficial safety effects, for example, to ensure the safety of personnel performing critical operations or to assess the reliability of standard and emergency operating procedures. From reliability definition follows that maintenance operations are performed at time intervals that do not coincide with mission time. In fact, maintenance may make the system unavailable for the time required to repair it. The availability function is also defined when systems or components are repairable.
The main availability parameter is MTTR (Mean Time to Repair) which expresses the expected time to achieve restoration. One parameter that depends on the previous one is MTBF (Mean Time Between Failure) and is obviously applicable only to repairable components. MRT (Mean Repair Time) is linked to these indicators, and it shows the expected overall repair time. In this case, "repair time" means the time strictly necessary to perform the maintenance work, without all those "work preparation" activities such as the time required for calling, picking up spare parts, tool or equipment searches, and fault diagnosis. MRT and MTTR can also be seen as complementary index that can provide more complete information when used in pairs. The difference between the two times allows the separation of intrinsic factors (especially maintenance) from operational factors (especially responsiveness, diagnostics, flexibility). MRT is an indispensable indicator for defining the technical specifications of a product at the design stage in synergy between maintenance and design.
MTTR and MRT in the functional safety world
Let us take a step back and bring these concepts of reliability into the world of functional safety. Before SIL was defined in 1997, with edition 1.0 of IEC 61508, safety systems and functions were expressed through qualitative measures also called AK (Anforderungklas) or RC (Requirement Class). In addition to AK/RC requirements, availability in percentage terms was also used. The MTTR index represented the time required to repair a fault. Prior to IEC 61508 edition 2.0, the MTTR was typically estimated at eight hours or multiples of eight up to ninety-six hours, which was generally the necessary and sufficient time to repair, regardless of the production cycle and the availability of spare parts and labor.
Definitions change with the introduction of the 2.0 editions of IEC 61508 (2010) and IEC 61511 (2016-2017). MTTR (Mean Time to Restoration) definition coincides with expected time to achieve restoration and is the sum of four addends: time to detect the failure, time spent before starting the repair, effective time to repair, and time before the component is put back into operation. The definition of MRT (Mean Repair Time) is then introduced as expected overall repair time.
To analyze faults, IEC 51511 requires documenting the failure behavior by quantifying all random failures. The calculated failure measure (PFDavg/PFH) for each instrumented safety function (SIF) shall be equal to, or better than, the target failure measure (RRF, Risk Reduction Factor) relative to the SIL, as specified in the Safety Requirement Specification (SRS). Therefore, information such as failure rate data of the subsystem, common causes of failures in the case of redundancy, proof test interval, useful lifetime of the device functionality, and of course the reliability and availability parameters MTTR, MTNF, MRT, MPRT (Maximum Permitted Repair Time) are needed. To acquire numerical data from this information, both IEC 61508 and IEC 61511 provide simplified formulas that can be implemented using specific software or spreadsheets.
On operational side, the proof test interval recommended by the manufacturer and defined in the safety manual must be considered. Furthermore, there is the estimated performance of the safety circuit to maintain a certain SIL performance (RRF). This interval is unfortunately often driven by production demands, which can amount to several years. If the interval between scheduled process downtimes is greater than the test interval, in-line test facilities integrated into the SIS design are required. The IEC TR 61511- 4 standard chapter specifies that SIS devices that are not able to be tested as often as required by the design, due to production availability requirements, are not fit for purpose. An existing system that cannot be tested to achieve the design SIL needs redesign.
Depending on the architecture used, MRT is the time to repair an "unavailable" safety function and turn it into "available." MRT affects the time when the safety function does not comply with the requirements or does not provide the estimated risk reduction. Since in many cases spare parts for safety functions are not available, the estimated MRT is totally dependent on the lead time of the failed component. Therefore, the IEC 61511 standard recommends storing identified and available spare parts to minimize the MRT.