Online Black-box Failure Prediction for Mission Critical Distributed Systems

(1)

Online Black-box Failure Prediction for Mission

Critical Distributed Systems

MIDLAB TECHNICAL REPORT 3/12 - 2012

Roberto Baldoni^∗, Giorgia Lodi^∗, Luca Montanari^∗, Guido Mariotta^∗ and Marco Rizzuto^†

∗“Sapienza” University of Rome

†SELEX Sistemi Integrati Rome, Italy

Abstract—Mission critical distributed systems are typically constructed out of many different software components strictly interacting with each other. Due to the inherent complexity of those systems and to the asynchrony of the interactions, they can exhibit anomalies that can lead to failures and, in some cases, to an abrupt block of the system, even though the software has been extensively tested. To avoid these failures and blocks, it is necessary to monitor at run time the system in order to discover such anomalies, thus predicting the occurrence of failures in advance. At this time, in order to minimize system damages appropriate measures can be taken by a system manager before the actual failure occurs.

This paper introduces a novel approach to failure prediction for mission critical distributed systems that has the distinctive features to be (i) black-box: no knowledge of applications’ internals and logic of the mission critical distributed system is required (ii) non-intrusive: no status information of the nodes (e.g., CPU) is used; and (iii) online: the analysis is performed during the functioning of the monitored systems. The proposed approach combines Complex Event Processing (CEP) and Hidden Markov Models (HMM) so as to analyze symptoms of failures that might occur in the form of anomalous conditions of performance metrics identified for such purpose.

The paper describes an architecture named CASPER, based on CEP and HMM, for such on-line and black-box monitoring of mission critical distributed systems. CASPER relies only on sniffed information from the communication network underlying the mission critical system to predict anomalies that can lead to software failures. An instance of CASPER has been specialized, trained and tuned so as to monitor an Air Traffic Control (ATC) system. An extensive experimental evaluation of CASPER is presented: During this evaluation some node of the ATC system has been subject to memory and I/O stress till this stress brought to a software failure. In every evaluation setting, CASPER exhibited (i) a very low percentage of false positives over both normal condition and under stress, and (iii) a sufficiently high failure prediction time, thus allowing the system to apply appropriate recovery procedures.

Index Terms—Anomaly Detection; Failure Forecasting; Black Box Failure Prediction; Complex Event Processing, Hidden Markov Model

category: regular paper approx. word count: 8155

I. INTRODUCTION

Context and Motivation. Distributed mission critical systems such as air traffic control, battlefield or naval command and control systems consist of several applications distributed over a number of nodes connected through a LAN or WAN. The

applications are constructed out of communicating software components that are deployed on those nodes and may change over time. The dynamic nature of applications is principally due to (i) the employed policies for resilience to software or hardware failures, (ii) the adopted load balancing strategies or (iii) the management of new comers.

In general, applications require to obtain and distribute data as well as to access to and provide a variety of services while meeting stringent Quality of Service (QoS) requirements.

These needs are normally met using both synchronous and asynchronous communications over client-server or publish- subscribe models. The amount of information being transferred in the communications is significant as regards to the number of involved endpoints and exchanged payload; at the same time, the desired real time performances are challenging in terms of latency and throughput.

In such complex real time systems, failures may happen with potentially catastrophic consequences for their entire functioning. The industrial trend is to face failures by using, during operational system life, supervision services that are not only capable of detecting and certificating a failure, but also predicting and preventing it through an analysis of the overall system behavior.

Such services shall have a minimum impact on the super- vised system and possibly no interaction with the operational applications. The goal is to plug-in a “ready-to-use observer”

that acts at run time and is both non-intrusive and black-box, i.e., it considers nodes and applications as black boxes. In mission critical systems, a large amount of data deriving from communications among applications transits on the network;

thus, the ”observer” can focus on that type of data, only, in order to recognize many aspects of the actual interactions among the components of the system. The motivation to adopt this non-intrusive and black-box approach is twofold.

Firstly, applications change and evolve over time: grounding failure prediction on the semantic of the applications’ communications would require a deep knowledge of the specific system design, a proven field experience, and a non-negligible effort to keep aligned the supervision service to the controlled system. Secondly, interactions between the service and system to be monitored might lead to unexpected behaviors, hardly manageable as fully unknown and unpredictable.

A large body of research is devoted to the study of failure

(2)

prediction mechanisms (e.g., [1]–[5]); however, to the best of our knowledge, none of them is able to satisfy together the non-intrusiveness, black-box and online properties required by mission critical contexts.

Contribution. In this paper we introduce the design, implementation and experimental evaluation of a novel online, non-intrusive and black-box failure prediction architecture we named CASPER that can be used for monitoring mission critical distributed systems. CASPER is (i) online, as the failure prediction is carried out during the normal functioning of the monitored system, (ii) non-intrusive, as the failure prediction does not use any kind of information on the status of the nodes (e.g., CPU, memory) of the monitored system;

only information concerning the network to which the nodes are connected is exploited as well as that regarding the specific network protocol used by the system to exchange information among the nodes (e.g., SOAP, IIOP, GIOP etc.); and (iii) black- box, as no knowledge of the application’s internals and of the application logic of the system is analyzed.

Our approach is based on the same hypotheses as other state-of-the-art works on this field (e.g. [3]) for which a fault generates increasingly unstable performance-related symptoms that may bring to software failures, and the system exhibits a steady-state performance behavior with a few variations when a non-faulty situation is observed.

Based on these hypotheses, the aim of CASPER is to recognize any deviation from normal behaviors of the monitored system by analyzing symptoms of failures that might occur in the form of anomalous conditions of specific performance metrics. The metrics are computed starting from the extraction of network data of interest the paper describes. CASPER can thus provide sufficient look-ahead time before a failure, where alerts can be raised and proper timely countermeasures can be applied by software recovery components or system administrators.

CASPER works by combining in a novel fashion Complex Event Processing (CEP) [6] and Hidden Markov Models (HMM) [7]. In particular, it uses a CEP engine to compute at run time the previous mentioned performance metrics. These are then passed to the HMM in order to recognize symptoms of an upcoming failure. Finally, such symptoms are evaluated by a failure prediction module we designed whose aim is to filter out as many as possible false positives and to provide at the same time a prediction of failure as earliest as possible.

We deployed CASPER for monitoring a real Air Traffic Control (ATC) system. Using the network data of such a system during a steady state performance behavior, we first trained CASPER in order to stabilize HMM and tune the failure prediction module. Then we conducted an experimental evaluation of CASPER that aimed to demonstrate its effective- ness in timely predicting failures. Specifically, we analyzed CASPER’s behavior in presence of memory and I/O stress conditions we induced: CASPER was able to predict failures with a low percentage of false positives and sufficient high prediction times in both cases. Finally, let us remark that once

the system has been trained and tuned, we monitored the ATC system during 24 hours in which no failure occurred and no false positives have been generated.

The paper is structured as follows. Section II provides an assessment of related work on failure prediction. Section III describes background on Hidden Markov Model and Complex Event Processing. Section IV presents the failure and prediction model we follow in the design of our architecture which is described in section V. Section VI introduces CASPER implementation for an ATC mission critical system. Section VII discusses the experimental evaluation of CASPER we have conducted in ATC. Finally, section VIII concludes the paper highlighting our future works.

II. RELATEDWORK

A large body of research is devoted to the investigation of approaches to online failure prediction. The interested readers can find a comprehensive taxonomy of them in [4]. In this section, we discuss only those that are closer to our solution and that mainly inspired the approach we followed.

We can distinguish between approaches that make use of symptoms monitoring techniques for predicting failures, and approaches that use error monitoring mechanisms.

[4] presents an error monitoring-based failure prediction technique that uses Hidden Semi-Markov Model (HSMM) in order to recognize error patterns that can lead to failures.

The approach in [4] is event-driven as no time intervals are defined: the errors are events that can be triggered anytime. [2]

describes two non-intrusive data driven modeling approaches to error monitoring (it uses error logs and logs of the operating system): the first approach relies on a Discrete Time Markov Model and the second approach is based on function approxi- mation. In general [2] performs short-term predictions in order to anticipate failures.

The difference between our approach and these works is twofold: firstly we propose a symptoms monitoring system in contrast to error monitoring for predicting software failures.

In addition, our approach is not event-based as proposed in [4]; rather, our solution can be defined period-based [8] as we use Hidden Markov Models (discrete time) to recognize, in the short term, patterns of specific performance metrics that show the evidence of symptoms of failures.

In the context of symptoms monitoring mechanisms, there exist research works that use black-box approaches, i.e., no knowledge of the applications of the system is required.

Narasimhan et. all [3] introduce Tiresias, a black-box failure prediction system that considers symptoms generated by faults in distributed systems. In order to identify symptoms, Tiresias uses a set of performance metrics that are system-level metrics, (e.g., file system proc metrics) and network traffic metrics. In addition, Tiresias uses specific heuristics in order to predict the failure.

[1] presents ALERT an anomaly prediction system that considers the hosts of the monitored system as black-boxes.

Specifically, it uses sensors deployed in all the hosts of the controlled infrastructure to continuously monitor a set

(3)

of metrics concerning CPU consumption, memory usage, input/output data rate. In this sense, ALERT can be categorized as an intrusive monitoring system that then makes use of triple- state decision trees in order to classify component states of the monitored system.

The authors in [9] consider the problem of discovering performance bottlenecks in large scale distributed systems consisting of black-box software components. The system introduced in [9] solves the problem by using message-level traces related to the activity of the monitored system in a non-intrusive fashion (passively and without any knowledge of node internals or semantics of messages).

[10] analyzes the correlation in time and space of failure events and implements a long-term failure prediction frame- work named hPrefects for such a purpose. hPrefects, based on the correlations among failures, forecasts the time-between- failure of future instances.

As all the earlier mentioned works (except [9] that does not properly present a failure prediction approach but rather a performance bottleneck detection approach), our solution adopts a symptoms monitoring approach to the failure prediction.

However, we differ from them for some important aspects.

Firstly, our approach is both black-box and non-intrusive. In contrast, the described works do not seem to satisfy these char- acteristics together for failure prediction purposes. Secondly, the techniques we employ to the monitoring are different from those used by them. It is true that we use message-level traces as [9]; however, we combine in a novel fashion both CEP for network performance metrics computation and HMM to infer the system state. Note that typically HMM is widely used in failure prediction to build a components’ state diagnosis [11].

In our architecture the entire system state is modeled as a hidden state and thus inferred by HMM.

Finally, there exist other types of researches [5] that apply online failure prediction to distributed stream processing systems, mostly using decision tree and classifiers. We do not specifically target these systems, and we use different techniques (CEP and HMM) to predict failures.

III. BACKGROUND

A. Hidden Markov Model

Hidden Markov Model (HMM) is a mathematical model largely used in several fields such as failure prediction [4], [11], speech recognition [12], [13], and molecular biology.

It permits to set up a probabilistic state diagnosis using a sequence of observations of symbols.

HMM consists of a hidden stochastic process, i.e., a Markov chain, whose state is not observable (the state is “hidden”). The hidden state emits an observable symbol each time it changes state. Being the hidden process in a given state, the observed symbol is emitted according to a given probability. Interested readers can refer to [7] for further details on the model.

A HMM consists of four elements:

1) a homogeneous first order Markov chain: a set {Q1, Q2, . . . } assuming values in Ω = {ω1, . . . , ωN} where Ω is the set of the N possible states of the model.

The Markov chain is characterized by a N × N matrix A, whose elements are defined as follows:

ai,j= p(Qt=ω_j|Qt−1= ωi)

and represent the probabilities to transit from the state ωi

to ωj at time t. Moreover the following property holds:

∀i = 1, . . . , N,

N

X

j=1

aij = 1

2) a set {X1, X₂, . . . }, that is an identically distributed, discrete time stochastic process assuming values in the discrete alphabet Σ = {σ1, . . . , σ_M}, where Σ is the set of the M possible observable symbols, namely, the alphabet of the model.

3) a matrix B whose elements are defined as:

bk(σj) = p(Xt= σj|Qt= ωk)

and represent the probabilities to emit a symbol σ_j at time t, given that the state of the Markov chain is ωk. Each symbol of the alphabet Σ can be emitted according to a given probability if the Markov model is in state ωk.

4) a vector ~π(1) having the generic element πi(1)(i = 1, . . . , N ) defined as:

πi= p(Qi= ωi)

πi represents the probability that the Markov chain is in the state Qi = ωi at the initial time. The following property holds:

N

X

i=1

π_i(1) = 1

In literature [7] three fundamental problems on HMM are defined:

• evaluation problem: given a sequence of observations S = {σ1, σ2, . . . , σL}, compute the probability that the HMM could emit S.

• decoding problem: given a sequence of observations S = {σ1, σ2, . . . , σL}, discover the most probable state sequence P ? = {ω1, ω2, . . . , ωL} able to generate the observed sequence S.

• training problem: given a set of sequences of observations Si(i = 1, . . . , n), adjust the model’s parameters to maximize the probability of emitting each Si.

In this paper we exploit the training problem and one of the most used features of the model: that is, the forward probability evaluation: given a sequence of observations S and a state ωk we compute the probability that HMM is in state ωk after having generated S. The forward probability considers all the possible sequences of states terminating in the state ωk which can have generated the sequence S: it is calculated using the following recursive formula:

f_k(i) =

(bk(σ1) · πk(1) i = 0 b_k(σ_i) ·P

w∈Σf_w(i − 1) · a_wk 1 < i ≤ L (1)

(4)

Time Failure Symptom

Fault

Time Fault

Alert

Failure Avoidance

Prediction Limit

time-to-prediction time-to-failure

Fig. 1. Fault, Symptoms, Failure and Prediction

We use HMM as state estimator rather than other more complex dynamic bayesian networks [14] since HMM provides us with high accuracy, with respect to the problem we wish to address, through simple and low complexity algorithms.

B. Complex Event Processing

Complex Event Processing (CEP) is an active research field recently used for the development of monitoring and sense- and-respond applications [15]. In CEP, three fundamental concepts are defined: event streams, correlation rules, and event engine. An event is a representation of a set of conditions in a given time instant. Events belong to streams: events of the same type are in the same event stream [16]. Correlation rules are commonly SQL-like queries (but also other types of queries are possible [17]) used by the event engine in order to correlate events: i.e.,in order to discover temporal and spatial relationships between events of possibly different streams.

There are several commercial products that implement the CEP paradigm [6], [18], [19]. Among those currently available, we have chosen the well-known open source event engine ESPER [6]. The use of ESPER is motivated by both its low cost of ownership compared to other similar systems [19], its offered usability, and the ability of dynamically adapting the complex event processing logic by adding at run time new queries.

In ESPER, while the events pass through memory, the query engine continuously sieves for the relevant events that may satisfy one of the correlation queries. The queries are defined using an SQL-like query language named Event Processing Language (EPL). EPL can thus support all SQL’s conventional constructs such as Group By, Having, Order By, Sum, etc.

However it adds further constructs (e.g., the pattern) that allows it to perform complex correlations among events.

IV. FAILURE ANDPREDICTIONMODEL

We model the distributed system to be monitored as a set of nodes that run one or more services. Nodes exchange messages over a communication network. Nodes or services can be subject to failures. A failure is an event for which the service delivered by a system deviates from its specification [20]. A failure is always preceded by a fault (e.g., I/O error, memory

CASPER

Symptoms Detection

Performance Metrics Computation

System State Inference

Pre-ProcessingEvents Symbols Failure

Prediction

Network Packets

Inferred System State

Prediction

Host N

Monitored System

Host 1 Host 2 Host 3

Communication Network

Actions Knowledge

Base

Fig. 2. The modules of the CASPER failure prediction architecture

misusage); however, the vice versa might not be always true.

i.e., a fault inside a system could not always bring to a failure as the system could tolerate, for example by design, such fault.

Faults that lead to failures, independently of the fault’s root cause (e.g., an application-level problem or a network-level fault), affect the system in an observable and identifiable way.

Thus, faults can generate side-effects in the monitored systems till the failure occurs. Our work is based on the assump- tions that a fault generates increasingly unstable performance- related symptoms indicating a possible future presence of a failure, and that the system exhibits a steady-state performance behavior with a few variations when a non-faulty situation is observed [3], [21], [22].

A failure prediction mechanism consists in monitoring the behavior of the distributed system looking for possible symptoms generated by faults, and in raising timely alerts regarding software failures if symptoms become severe. Proper countermeasures can be set before the failure to either mitigate damages or enable recovery actions.

In Figure 1 we define Time-to-failure the distance in time between the occurrence of the prediction and the software failure event. The prediction has to be raised before a time Limit, beyond which the prediction is not sufficiently in advance to take some effective actions before the failure occurs. We also consider the time-to-prediction which represents the distance between the occurrence of the first symptom of the failure and the prediction. As a matter of fact, an ideal failure prediction mechanism should produce zero false positives, i.e., zero errors in raising alerts regarding failures, and exhibit the maximum prediction time (i.e., it is able to produce an alert indicating the prediction of a failure when the first symptom is observed).

V. THECASPERFAILURE PREDICTION ARCHITECTURE

The architecture we designed for online black-box non- intrusive failure prediction is named CASPER and is deployed in the same subnetwork as the distributed system to be monitored. Figure 2 shows the principal modules of CASPER that are described in isolation as follows.

(5)

A. Pre-Processing module.

It is mainly responsible for capturing and decoding network data required to recognize symptoms of failures and for producing streams of events. The streams of events carry a large amount of information that is obtained from the entire set of network packets exchanged among the interconnected nodes of the monitored distributed system (see Figure 2).

Considering all the packets allows CASPER to own a larger view of what is happening on the network, thus augmenting the chance of discovering specific performance patterns that show the evidence of possible symptoms of failures. The network data the Pre-Processing module receives as input are properly manipulated. Data manipulation consists in firstly decoding data included in the headers of network packets.

The module manages TCP/UDP headers and the headers of the specific inter-process communication protocol used in the monitored system (e.g., SOAP, GIOP, etc) so as extract from them only the information that is relevant in the detection of specific symptoms (e,g., the timestamp of a request and reply, destination and source IP addresses of two communicating nodes). Finally, the Pre-Processing module adapts the extracted network information in the form of events to produce streams for the use by the second CASPER’s module (see below).

B. Symptoms detection module.

The streams of events are taken as input by the Symptoms detection module and used to discover specific performance patterns through complex event processing (i.e., event correlations and aggregations). The result of this processing is a system state that must be evaluated in order to detect whether it is a safe or unsafe state. To this end, we divided this module into two different components, namely a performance metrics computation component and a system state inference component.

The performance metrics computation component uses a CEP engine for correlation and aggregation purposes. It then periodically produces as output a representation of the system behavior in the form of symbols (see Figure 2). Note that, CASPER requires a clock mechanism in order to carry out this activity at each CASPER clock cycle. The clock in CASPER allows it to model the system state using a discrete time Markov chain and let the performance metrics computation component coordinate with the system state inference one (see below)

The representation of the system behavior at run time is obtained by computing performance metrics, i.e., a set of time-changing metrics whose value indicates how the system actually works (an example of network performance metric can be the round trip time).

In CASPER we denote symbols as σm, where m = 1, . . . , M . Each symbol is built by the CEP engine starting from a vector of performance metrics: assuming P performance metrics, at the end of the time interval (i.e. the clock period) i = [t0, t1], ∀t0 < t1, the performance metrics

computation component emits one of the σmwhich is defined as follows:

σm= f ([p1(i), p2(i), . . . , pP(i)])

where the generic pp(i) is the mean value the performance metric p assumes in the time interval i, and the function f takes in input a vector in R^P and returns an integer value σm

belonging to the finite alphabet Σ.

The system state inference component receives a symbol from the previous component at each CASPER clock cycle and recognizes whether it is a correct or an incorrect behavior of the monitored system. To this end, the component uses Hidden Markov Models.

We recall that HMM consists of a hidden stochastic process, a set of symbols Σ and two probability matrices A and B as defined in Section III. Figure 3 shows how we instantiated HMM in our architecture.

Safe Unsafe1

Faulty

σ2

σ1 σ3 σM

Unsafe2 UnsafeK

Hidden Process Symbols

0.8

0.2 0.6 0.4 0.2 0.8 0.9

0.1 0.1

0.9 0.1

0.7 0.2

0.1

Fig. 3. Hidden Markov Models graph used in the system state inference component

We model the state of the system to be monitored by means of the hidden process. We define the states of the system (see Figure 3) as:

• Safe: the system behavior is correct as no active fault [20]

is present;

• Unsafe: some faults, and then symptoms of failures, are present. There will be an unsafe state per each kind of possible fault. We assume a finite number of k typologies of faults (e.g., memory stress);

• Faulty: there is a failure in the system.

The number of states N is the sum of the k unsafe states, the safe state and the faulty state: N = k + 2. We assume that initially the system is in a safe state: if we call it ω₀, then

π0(1) = p(Q0= ω0) = 1

Since the state of the system is not known a priori, we can observe it only looking at the emissions of symbols.

Figure 3 represents the emitted symbols as the set of {σ1, σ2, σ3, . . . , σM}. In addition, Figure 3 shows labeled edges among the vertices of the hidden process; these represent the values of the A matrix, i.e., the matrix of the transition probabilities. In contrast, the edges that connect

(6)

the states of the hidden process to the symbols are the probabilities to emit a given symbol σm, that is, the B matrix of HMM.

Let us remark here a few observations concerning the cardinality of the alphabet Σ of the symbols that can be emitted by the HMM hidden process. The cardinality |Σ| = M should be reasonably small, since the matrix B has N × M elements and must be in primary memory in order to allow HMM’s algorithms to work efficiently. In particular, M is computed as follows:

M =

m

Y

i=1

vi

where m is the number of performance metrics and vi is the number of values that the ith performance metric p can assume.

To keep the cardinality small we assume that the possible values of the generic pishould be limited: if we consider any p_i∈ R, meaning that vi = ∞, by definition of M , M is also

∞. To maintain M limited we introduce a discretization of the values of each p_i. In order to clarify these concepts we consider the following example.

Assume that there exist two performance metrics p1 and p2

(i.e., m = 2). After each time interval (i.e., each CASPER clock cycle) the performance metrics computation component emits a vector ~σi having m = 2 components. Let us suppose that an emission is ~σcontinue = [51.34, 58.22], meaning that during the last interval the mean value of p1 = 51.34 and of p2 = 58.22 and that the performance metrics have a number of values that are v1= v2= 4. We obtain a discrete space of

M = v1× v2= 4 × 4 = 16

intervals. In order to calculate where every pair of performance metrics hp1, p2i is of the 16 intervals, we need to know in advance which is the maximum value that each pkcan assume.

Let us suppose that for both p1 and p2 this maximum value is 100.

Normalizing the values we obtain that ~σcontinue norm = [(51.34/100) ∗ 4 = 2.05, (58.22/100) ∗ 4 = 2.34]; that is, following the interval order, the emitted symbol is σ11.

We can represents it in a Cartesian plane (see Figure 4) where on the x axis there are the normalized values of p1and on the y axis there are the normalized values of p2.

The circle in Figure 4 is the point ~σcontinue norm; the square where the circle is located is the actual emitted symbol, i.e., σ11.

C. Failure Prediction module

It is mainly responsible for correlating the information about the state received from the system state inference component of the previous CASPER module. It takes in input the inferred state of the system at each CASPER clock-cycle. The inferred state can be a safe state or one of the possible unsafe states unsaf e1. . . unsaf ek. Using the CEP engine, this module counts the number of consecutive unsaf eistates and produces

p1 p2

1 2 3 4

0 1 2 3 4

Fig. 4. Example of emission with discrete and normalized space

a failure prediction alert when that number reaches a tunable threshold (see below). We call this threshold window size, a parameter that is strictly related to the time-to-prediction shown in Figure 1.

D. Training of CASPER

The knowledge base (see Figure 2) concerning the possible safe and unsafe system states of the monitored system is composed by the matrices A and B of the system state inference module. This knowledge is built during an initial training phase. Specifically, the entries of the matrices are adjusted during the training phase using the max likelihood state estimators of the HMM [7]. During the training, CASPER is fed concurrently by both recorded network traces and a sequence of pairs <system-state,time> that represents the state of the monitored system (i.e., safe, unsafe, faulty) at a specific time¹.

E. Tuning of CASPER parameters

CASPER architecture has three parameters to be tuned whose values influence the quality of the whole failure prediction mechanism in terms of false positives and time-to- prediction. These values are:

• length of the CASPER clock period;

• number of symbols output by the performance metrics computation module;

• length of the failure prediction, i.e., window size.

The length of the clock period influences the performance metrics computation and the system state inference: the shorter the clock period is, the higher the frequency of produced symbols is. A longer clock period allows CASPER to minimize the effects of outliers. The number of symbols influences the system state inference: if a high number of symbols are chosen, a higher precision for each performance metrics can be obtained. The failure prediction window size corresponds to the minimum number of CASPER clock cycles required for raising a prediction alert. The greater the window size, the more the accuracy of the prediction, i.e., the probability that the prediction actually is followed by a failure (i.e. a true

1As the training is offline, the sequence of pairs

<system-state,time> can be created offline by the operator using network traces and system log files.

(7)

positive prediction). The tradeoff is that the time-to-prediction increases linearly with the windows size causing shorter time- to-failure (see Figure 1);

During the training phase, CASPER automatically chooses the best values for both clock period and number of symbols, leaving to the operator the responsibility to select the windows size according to the criticality of the system to be monitored.

VI. MONITORING ACORBA-BASEDATCSYSTEM WITH

CASPER

CASPER has been specialized to monitor a real Air Traffic Control system. ATC systems are composed by middleware- based applications running over a collection of nodes connected through a Local Area Network (LAN). The ATC system that we monitored is based on CORBA [23] middleware, standardized by the OMG (Object Management Group) [24].

In this section, we briefly explain CORBA and its communication protocol, we show the performance metrics calculated by the Complex Event Processing (CEP) engine, and finally we provide details concerning the Java implementation of the CASPER modules.

A. CORBA

The Common Object Request Broker Architecture (CORBA) enables software components written in multiple computer languages, and running on multiple computers, to work together. Specifically, CORBA normalizes the method-call semantics between application objects that reside either in the same address space (application) or remote address space (same host, or remote host on a network). The interactions among hosts in CORBA are simple: typically there exists one server offering public functions and a number of clients invoking functions on the server. The communication protocol used in CORBA is called GIOP (General Inter-ORB Protocol). GIOP is an abstract protocol that enables messages exchange among CORBA components.

CASPER intercepts GIOP messages produced by the CORBA middleware supporting the ATC system and extracts several information from them (see below) in order to build the representation of the system at run time.

B. Event representation

Each GIOP message intercepted by CASPER becomes an event feeding the CEP engine of the performance metrics computation component. Each event contains the following information extracted by the message:

• Request ID- The identifier of a request-reply interaction between two CORBA entities;

• Message Type - A field that characterizes the message.

This value can assume different values (e.g., Request, Reply, Close Connection);

• Reply Status- The final outcome of an interaction. It spec- ifies whether there were exceptions during the request- reply interaction and, if so, the kind of the exception (i.e., user or system exception);

In addition, we insert in the event further information related to the lower level protocols (TCP/UDP) such as source and destination IP, source and destination port, and timestamp (i.e., the time at which the message is captured). In order not to capture sensitive information of the ATC system (such as flight plans or routes), CASPER ignores the payload of GIOP messages.

C. Performance metrics

Events sent to the CEP engine are correlated online so as to produce so-called performance metrics. After long time of observations of several metrics of the ATC CORBA-based system, we identified the following small set of metrics that well characterize the system, showing a steady behavior in case of absence of faults, and an unstable behavior in presence of faults:

• Round Trip Time:elapsed time between a request and the relative reply;

• Rate of the messages carrying an exception: the number of replies messages with exception over the number of caught messages;

• Average message size:the mean of the messages size in a given spatial or temporal window;

• Percentage of Replies: the number of replies over the number of requests in a given spatial or temporal window;

• Number of Requests without Reply: the number of requests that, in a given temporal window, do not receive a reply;

• Messages Rate: the number of messages exchanged in a given spatial or temporal window.

A few observations regarding these metrics are worth being mentioned. Let us consider the percentage of replies: since the communication in CORBA is composed by request-reply interactions, the rate should be near the 50%. Deviations from that percentage can show problems in one of the hosts involved in the communication. The Round Trip Time takes into consideration possible variations due to the workload. A decrease of the Message Rate can show a node particularly busy or bottlenecks in the network. Deviation on the Number of Requests without Replycan show that a node is overloaded and cannot answer to all the requests it receives.

To compute the performance metrics we use the CEP engine ESPER [6] written in Java. It uses EPL queries in order to specify the correlations to check against the events it receives.

Below we reported an example of EPL query we use when we compute the Round Trip Time (RTT) of a request-reply interaction:

SELECT rep.requestID,

(rep.timestamp - req.timestamp) as RTT FROM pattern [

every req=

Message(messageType = ’REQUEST’) ->

rep=Message(messageType = ’REPLY) AND (rep.requestID = req.requestID)

(8)

In order to calculate the RTT we use the requestID as identifier for a request-reply interaction (note that in CORBA the reply has always the same requestID of the relative request). The difference between the timestamp of the request and the timestamp of the reply is the RTT of each reply.

D. Java Implementation of CASPER

In order to implement the entire architecture of Figure 2 we developed and integrated the following software modules in Java:

• A Network sniffer that captures network packets from the network and filters only GIOP packets (this activity is implemented as part of the pre-processing module);

• A Dissector that, given the GIOP packets, it takes all the information except the payload and creates the events (this activity is implemented as part of the pre-processing module);

• A System State Inference module that infers the system state starting from the symbols received from the performance metrics computation component, and from its knowledge base;

• A Failure Prediction module that correlates the alerts received from the system state inference module.

The chosen network sniffer is Jpcap [25], a java porting of the C-library pcap. Jpcap automatically creates a Plain Old Java Object(POJO) starting from a network packet (e.g. TCP segment), representing each field of the packet as a private variable. ESPER supports as native input event representation the POJO format, allowing us not to use specific input- adapters. Since the specification of CORBA is standard, we were able, given the POJOs coming from the sniffer, to create a Dissector that takes information (e.g., request id, message type) belonging to the GIOP headers. The EPL queries correlate these events computing the earlier performance metrics.

The forward algorithm and all the other elements required by the HMM of the system state inference component have been implemented using java ad-hoc classes. We used CEP capability also to implement the Failure Prediction module of the architecture since its functionality can easily be designed and coded using EPL queries.

VII. CASPER EXPERIMENTALEVALUATION

We deployed CASPER so as to monitor an Air Traffic Control system of one of the major players of the ATC market². The first part of the work on the field has been to collect a great amount of network traces from the ATC underlying communication network when in operation. These traces represented steady state performance behaviors. Addi- tionally, on the testing environment of the ATC system we stressed some of the nodes till achieving software failures conditions, and we collected the relative traces. In our test field, we consider one of the nodes of the ATC system be affected by either Memory or I/O stress.

2For double-blind review reasons we do not disclose the name of the company. More details on the ATC system will be given in the final version of the paper if accepted.

After the collection of all these traces, we trained CASPER and once the training phase was over we deployed CASPER again on the testing environment of the ATC system in order to conduct experiments in operative conditions.

Our evaluation assesses the system state inference component accuracy and the failure prediction module accuracy. In particular. we evaluate the former in terms of:

• Ntp(number of true positives): the system state is unsafe and the inferred state is “system unsafe”;

• Ntn (number of true negatives): the system state is safe and the inferred state is “system safe”.

• Nf p (number of false positive): the system state is safe but the inferred state is “system unsafe”;

• Nf n (number of false negatives): the system state is unsafe but the inferred state is “system safe”;

Using these parameters, we compute the following metrics that define the accuracy of CASPER:

• Precision: p = Ntp

N_tp+ N_{f p}

• Recall (or true positive rate): r = Ntp

Ntp+ Nf n

• F-measure: F = 2 × p × r p + r

• False Positive Rate: f.p.r. = Nf p

Nf p+ Ntn

We evaluate the latter module in terms of:

• N_{f p} (number of false positive): the module predicts a failure that is not actually coming;

• Nf n (number of false negatives): the module does not predict a failure that is coming.

• Time-to-failure: The time interval between the prediction and the actual failure (see Section IV).

A. Testbed

We deployed CASPER in a dedicated host located in the same LAN as the ATC system to be monitored (see Figure 2). This environment is actually the testing environment of the ATC system where new solutions are tested before getting into the operational ATC system. The testing environment is composed by 8 machines, 16 cores 2.5 GHz CPU, 16 GB of RAM each one. It is important to remark that CASPER does not know anything about the environment in which has been placed. Neither the application nor the service logic nor the testbed details are available to CASPER.

B. Faults and Failures

The ATC testbed includes two critical servers: one of the server is responsible for disk operations (I/O) and another server is the manager of all the services. In order to induce software failures in the ATC system, we apply the following actions in such critical servers:

1) memory stress: we start a memory-bound component co- located with the manager of all ATC services, to grab constantly increasing amount of memory resource.

(9)

2) I/O stress: we start an I/O-bound component co-located with the server responsible for disk operations , to grab constantly increasing amount of disk resource.

In both cases we brought the system to the failure of critical services. During the experiment campaign, we also considered the CPU stress; however, we discovered that due to the high computational power of the ATC nodes, the CPU stress never causes failures. For this reason we decided not to show the results of these tests.

C. Training data

We trained CASPER (see Section V-D) using the following recorded traces:

1) between 10 and 13 minute long traces (5 in total) in which the ATC system is behaving in a steady-state.

These traces are taken from the ATC system when in operation;

2) between 10 and 11 minute long traces (4 per each kind of injected stress, i.e., memory and I/O stress) in which one of the services of the ATC system fails. These traces are taken from the testing environment of the ATC system.

D. Results

The results are divided in three subsections: the training of CASPER, the tuning of the parameters (both before the deployment), and the failure prediction evaluation (using network traces and deploying the system). We considered three performance metrics: Number of request without reply, Round Trip Time and Message Rate.

Training of CASPER. During the training phase, the performance metrics computation component produces a symbol at each CASPER clock cycle. Thanks to the set of pairs

<system-state,time>we are able to represent the emitted symbols in case of safe and unsafe system states. Figure 5 illustrates these symbols. Each symbol is calculated starting from a combination of three values. In this case, we have 6 possible values per each performance metric; the number of different symbols is therefore 6 × 6 × 6 = 216. Observing Figure 5 we can notice that the majority of the emissions belong to the interval [0, 2] for the Round Trip Time, and [0, 1] for Number of Request Without Reply and Message Rate.

Starting from the symbols represented in Figure 5, the HMM- based component builds the matrices A and B. After that CASPER can be considered trained.

Tuning of CASPER parameters: clock period and number of symbols. After the training of HMM, CASPER requires a tuning phase to set the clock period and number of symbols in order to maximize the accuracy (F-measure, precision, recall and false positive rate) of the symptoms detection module output. This tuning phase is done by feeding the system with a recorded network trace (different from the one used during the training). Figure 6 plots the obtained F-Measure values when varying clock period and number of symbols.

We can see that the best choice of the clock period is 800 milliseconds. This period yields a higher F-Measure value than

0 1 2 3 4 5 6

0 1 2 3 4 5 6 0 1 2 3 4 5 6

Number of Requests Without Reply Round Trip Time

Message Rate

System Safe Emissions System Unsafe Emissions

Fig. 5. Symbols emitted by the performance metrics computation component in case of a recorded trace that exhibits stress of the memory.

64125 216 343 512 729 1000

0 0.2 0.4 0.6 0.8 1

Number of Symbols

F−Measure

100ms 300ms 800ms 1000ms

Fig. 6. Symptoms detection module: F-Measure varying the number of symbols and clock period in case of a recorded trace subject to memory stress

the other clock values in most of the number of symbols considered in the plot (note that this value of the clock also resulted the best choice in case of I/O stress). Figure 6 also shows that the accuracy of the symptoms detection module is highly influenced by the number of symbols and less influenced by the clock period (e.g., if we consider a number of symbols greater than 216, the F-measure values vary of less than 10% when augmenting the clock period from 100ms to 1000ms). Thus, CASPER set the clock period to 800 milliseconds, choosing the max F-measure value. Once fixed this clock period, the second parameter to define is the number of symbols. Figure 7 shows the precision, recall, F-measure and false positive rate of the symptoms detection module varying the number of symbols.

CASPER considers the maximum difference between the F-measure and the false positive rate in order to choose the ideal number of symbols (ideally, F-measure is equal to 1 and

(10)

0 64 125 216 343 512 729 1000 0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of Symbols

F−Measure Precision Recall False Positive Rate

Fig. 7. Performance of the symptoms detection module varying the number of possible symbols in case of a recorded trace subject to memory stress.

CASPER clock period 800 ms

f.p.r. to 0). As shown in Figure 7, considering 216 symbols (6 values per performance metric) we obtain F = 0.82 and f.p.r. = 0.12 which is actually the best situation in case of memory stress.

Figure 8 shows the accuracy of the symptoms detection module in case of I/O stress. Best values of F-Measure and false positive rate (F = 0.86 and f pr = 0) are obtained for 512 symbols (meaning 8 values per performance metrics).

0 64 125 216 343 512 729 1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Number of symbols Precision

Recall False Positive Rate F−Measure

Fig. 8. Performance of the symptoms detection module varying the number of possible symbols in case of a recorded trace subject to I/O stress. CASPER clock period 800 ms

Tuning of CASPER parameters: window size. The window size is the only parameter that has to be tuned by the operator according to the tradeoff discussed in Section V-E.

We experimentally noticed that during fault-free executions the system state inference still produced some false positives.

However, the probability that there exists a long sequence of false positives in steady-state is very low.

Thus, we designed the failure prediction module to recognize sequences of consecutive clock cycles whose inferred state is not safe. Only if the sequence is longer more than

a certain threshold CASPER rises a prediction. The length of these sequences multiplied by the clock period (set to 800ms) is the window size. The problem is then to set up a reasonable threshold in order to avoid false positive predictions during steady-state, Figure 9 illustrates the number of the false positive varying the window size.

4 8 12 16 20 24

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Window Size (Seconds)

Percentage of False Positives

Fig. 9. False positives varying the window size feeding. CASPER is fed with a recorded trace behaving in steady-state. Its clock period is 800 ms

From this Figure it can be noted that the window size has to be set to at least 16 seconds in order not to incur in any false positives. Let us remark that the window size also corresponds to the minimum time-to-prediction. All the results presented below are thus obtained using a window size of 16 seconds.

Results of CASPER failure prediction. We run two types of experiments once CASPER was trained and tuned. In the first type, we injected the faults described in section VII-B in the ATC testing environment and we carried out 10 tests for each type of fault³. In the second type, we observed the accuracy of CASPER when monitoring for 24h the ATC system in operation. These types of experiments and their related results are discussed in order as follows.

As first test, we injected a memory stress in one of the node of the ATC system till a service failure. Figure 10 shows the anatomy of this failure in one test. The ATC system runs with some false positive till the time the memory stress starts at second 105. The sequence of false positives starting at second 37 is not sufficiently long to create a false prediction. After the memory stress starts, the failure prediction module outputs a prediction at second 128; thus, the time-to-prediction is 23s.

The failure occurs at second 335, then the time-to-failure is 207s, which is satisfactory with respect to ATC system recovery requirements. We can also see a little burst of system

3The number of tests has been limited by the physical access to the ATC testing environment. In fact, every experiment of 10 minutes takes actually 2 hours to be completed due to the storage of the data, the stabilizing and rebooting of the ATC system after the failure.

(11)

0 37 100 128 200 300 335 Safe State

Unsafe State

Time (seconds) System State Inferred

Real System State Prediction

Service Failure

time−to−failure time−to−prediction

Fig. 10. Failure prediction in case of memory stress starting at second 105. Window size 16s, clock period 800ms, time-to-prediction 23s, time-to-failure 207s

0 100 222 300 400 500 598

Safe State Unsafe State

Time (seconds) System State Inferred

Real System State Prediction

time−to−failure time−to−prediction

Service Failure

Fig. 11. Failure prediction in case of I/O stress starting at second 408. Window size 16s, clock period 800ms, time-to-prediction 21s, time-to-failure 376s

state inference component’s false negatives starting at second 128, successfully ignored by the failure prediction module.

Figure 11 shows the anatomy of the failure in case of I/O stress in one test. A failure caused by I/O stress happens after 408 seconds from the start of the stress (at second 190) and has been predicted at time 222 after 32 seconds of stress, with a time-to-prediction equal to 376 seconds before the failure.

There is a delay due to the false negatives (from second 190 to 205) that the system state inference component produced at the start of the stress period. The time-to-prediction is 21s. In general, we obtained that in the 10 tests we carried out, the time-to-failure in case of memory stress varied in the range of [183s, 216s] and the time-to-prediction in the range of [20.8s, 27s].

In case of I/O stress, in the 10 tests, the time-to-failure varied in the rage of [353s, 402s] whereas the time-to- prediction in the range of [19.2s, 24.9s].

Finally, we performed a 24h test deploying CASPER on the network of the ATC system in operation. In these 24 hours the system exhibited steady-state performance behavior. CASPER did not produce any false positive along the day. Figure 12 depicts a portion of 400 seconds of this run.

VIII. CONCLUSIONS AND FUTURE WORK

We presented an architecture to predict online failures of mission critical distributed systems. The failure prediction ar-

0 50 100 150 200 250 300 350 400

Safe State Unsafe State

Time (seconds)

System State Inferred

Fig. 12. 400 seconds of a steady-state run of the ATC system in operation.

chitecture, namely CASPER, provides accurate predictions of failures by exploiting only the network traffic of the monitored system. In this way, it results non intrusive with respect to the nodes hosting the mission critical system and it executes a black-block failure prediction as no knowledge concerning the layout and the logic of the mission critical distributed system is used. To the best of our knowledge, this is the first failure detection system exhibiting all these features together.

Let us remark that the black-box characteristic has a strategic value for a company developing such systems. Indeed from a company perspective the approach is succeeding as long as the failure prediction architecture is loosely bound to the application logic as this logic evolves continuously over time.

Additionally there should be no direct or indirect influence

(12)

between the monitored and the monitoring system, in order not to falsify the measures and create hidden workload transfer.

This non intrusiveness of our approach has the advantage that no additional load to the monitored system is introduced and it can be applied in all the existing middleware based systems without modifications of the architecture.

CASPER uses a CEP engine to aggregate and correlate data coming from the monitored system and Hidden Markov Models (HMM) to detect unstable performance behavior of the monitored system. We studied a set of metrics representing the system state of the monitored system. These metrics are influenced by the change of the conditions of the system, or by the change of the system monitored workload. They are derived only from network traffic and computed at run time via the CEP engine. The engine passes the results of the computation to a CASPER module that uses HMM. This module, being trained before, can recognize dangerous conditions that can lead to failures in the monitored system. Results showed that CASPER, after a careful training phase, achieved good accuracy in terms of false positives when both failure conditions and steady-state behaviors were considered. Additionally, the exhibited time-to-failure when injecting faults inside the monitored system resulted quite reasonable. Finally, it is worth noticing that once trained in the testing environment of the ATC system, CASPER has been attached to the ATC system in operation and it has not produced any false positives along 24 hours.

As future work we are developing a version of CASPER that is able to execute online training. i.e., the training is done just connecting to the monitored system without any human intervention. This will make CASPER a complete

“plug-and-play” failure prediction system. The advantage of the online training solution is that CASPER can analyze a huge amount of network data (improving its knowledge base).

The disadvantage is that the training phase can last for long time as CASPER does not have any external clue concerning the safe or faulty system state in contrast to the case of offline training.

REFERENCES

[1] Yongmin Tan, Xiaohui Gu, and Haixun Wang. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, PODC ’10, pages 173–182, New York, NY, USA, 2010.

ACM.

[2] G¨unther A. Hoffmann, Felix Salfner, and Miroslaw Malek. Advanced Failure Prediction in Complex Software Systems. Technical Report 172, Berlin, Germany, 2004.

[3] Andrew W. Williams, Soila M. Pertet, and Priya Narasimhan. Tiresias:

Black-box failure prediction in distributed systems. In Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), Los Alamitos, CA, USA, 2007.

[4] F. Salfner. Event-based Failure Prediction: An Extended Hidden Markov Model Approach. PhD thesis, Department of Computer Science, Humboldt-Universit¨at zu Berlin, Germany, 2008. Available at http://www.rok.informatik.hu-berlin.de/Members/salfner/publications/

salfner08event-based.pdf.

[5] Xiaohui Gu Spiros Papadimitrioul Philip S.Yu Shu Ping Chang. Online failure forecast for fault-tolerant data stream processing. In Proceedings of IEEE 24th International Conference on Data Engineering (ICDE 2008), pages 1388 – 1390, 2008.

[6] Esper project web page, 2011. http://esper.codehaus.org/.

[7] L. Rabiner and B. Juang. An introduction to hidden markov models.

ASSP Magazine, IEEE, 3(1):4 – 16, jan 1986.

[8] Zhiling Lan Li Yu, Ziming Zheng and Susan Coghlan. Practical Online Failure Prediction for Blue Gene/P: Period-based vs Event-driven. In Proceeding of 41st IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W 2011), pages 259 – 264, 2011.

[9] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. Performance debugging for distributed systems of black boxes. In SIGOPS Oper. Syst. Rev., 37:74–

89, October 2003.

[10] Song Fu and Cheng zhong Xu. Exploring event correlation for failure prediction in coalitions of clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC07), 2007.

[11] A. Daidone, F. Di Giandomenico, A. Bondavalli, and S. Chiaradonna.

Hidden Markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In Proceedings of 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), pages 245–

256, Leeds, UK, October 2006.

[12] Jeff Bilmes. What HMMs Can Do. Graphical Models, E89-D(3):869–

891, 2006.

[13] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of IEEE, volume 77, pages 257–286. IEEE, 1989.

[14] Kevin Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, Computer Science Division, 2002.

[15] David C. Luckham. The Power of Events: An Introduction to Complex Event Processing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001.

[16] Peter Niblett Opher Etzion. Event Processing in Action. Manning Pubblications Co., August 2010.

[17] Bugra Gedik Henrique Andrade Kun-Lung Wu Philip S.

Yu MyungCheol Doo. Spade: The system s declarative stream processing engine. In Proceedings of ACM SIGMOD international conference on Management of data, Vancouver, BC, Canada, June 9–12 2008.

[18] Mark Proctor, Michael Neale, Bob McWhirter, Kris Verlaenen, Edson Tirelli, Alexander Bagerman, Michael Frandsen, Fernando Meyer, Ge- offrey De Smet, Toni Rikkola, Steven Williams, and Ben Truit. JBoss Drools Fusion. http://www.jboss.org/drools/drools-fusion.html, 2010.

[19] IBM’s System S Web Site, 2011. http://domino.research.ibm.com/comm/

research projects.nsf/pages/esps.index.html.

[20] Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl E.

Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput., 1(1):11–33, 2004.

[21] C.S. Hood and C. Ji. Proactive network-fault detection [telecommunica- tions]. In IEEE Transactions on Reliability, 46(3):333 –341, september 1997.

[22] Marina Thottan and Chuanyi Ji. Properties of network faults. In Pro- ceeding of IEEE/IFIP Network Operation and Management Symposium (NOMS 2000), pages 941–942, 2000.

[23] Object Management Group. Common Object Request Broker Architec- ture (CORBA). Specification, Object Management Group, 2011.

[24] Object management group webpage, 2011. http://www.omg.org/.

[25] Jpcap Project Web Page, 2011. http://netresearch.ics.uci.edu/kfujii/

Jpcap/doc/.