Event-based failure prediction
Autoren
Mehr zum Buch
Human lives and organizations are increasingly dependent on the correct functioning of computer systems and their failure might cause personal as well as economic damage. There are two non-exclusive approaches to minimize the risk of such hazards: (a) fault-intolerance tries to eliminate design and manufacturing faults for hardware and software before a system is put into service. (b) fault-tolerance techniques deal with faults that occur during service trying to avert that faults turn into failures. Since faults, in most cases, cannot be ruled out, we focus on the second approach. Traditionally, fault tolerance has followed a reactive scheme of fault detection, location and subsequent recovery by redundancy either in space or time. However, in recent years the focus has changed from these reactive methods towards more proactive schemes that try to evaluate the current situation of a running system in order to start acting even before the failure occurs. Once a failure is predicted, it may either be prevented or the outage may be shifted from unplanned to planned downtime, which can both improve significantly the system's reliability. The first step in this approach, online failure prediction, is the main focus of this thesis. The objective of the online failure prediction is to predict the occurrence of failures in the near future based on the current state of the system as it is observed by runtime monitoring. A new failure prediction method that builds on the evaluation of error events is introduced in this dissertation. More specifically, it treats the occurrence of errors as an event-driven temporal sequence and applies a pattern recognition technique in order to predict upcoming failures. Hidden Markov models have successfully solved many pattern recognition tasks. However, standard hidden Markov models are not well-suited to processing sequences in continuous time and existing augmentations do not account adequately for the event-driven character of error sequences. Hence, an extension of hidden Markov models has been developed that employs a semi-Markov process to state traversals providing the flexibility to model a great variety of temporal characteristics of the underlying stochastic process. The proposed hidden semi-Markov model has been applied to industrial data of a commercial telecommunication platform. The case study showed significantly improved failure prediction capabilities in comparison to well-known existing approaches. The case study also demonstrated that hidden semi-Markov models perform significantly better than standard hidden Markov models. In order to assess the impact of failure prediction and subsequent actions, a reliability model has been developed that enables to compute steady-state system availability, reliability and hazard rate. Based on the model, it is shown that such approaches can significantly improve system dependability.