Analysis of the cardiopulmonary response in incremental exercise tests

(1)

M.S. Degree in Computer and Communication Networks Engineering

Master’s Thesis

Analysis of the cardiopulmonary response in incremental exercise tests

using data mining techniques

Supervisors

prof. Silvia Chiusano prof. Tania Cerquitelli

Candidate Silvia Ferrara

March 2015

(2)

Acknowledgements

I would like to acknowledge the support of all those people who have directly or indirectly been an integral part of my work.

I would like to express my special thanks to my guides prof. Silvia Chiusano and prof.

Tania Cerquitelli for their supervision, technical guidance and constant support throughout this work.

Very special thanks go to my family, who have always been of great support and encour- agement to me.

(3)

Abstract

Over the past few decades, there has been a dramatic increase in the amount of available data in electronic form, which has resulted in data mining techniques rapidly gathering strength in many fields from medicine to e-business. There has been growing interest in data mining techniques applied to the clinical domain as well, with the aim of aiding diagnosis and enhancing medical procedures.

Incremental exercise tests have become a remarkable and versatile tool, which is widely used in the routine clinical evaluation of patients’ health status. Cardiopulmonary exercise tests (CPETs) are maximal incremental tests that can reveal abnormal physiological functioning, which may occur when the subject undergoes intense physical stress. Incremental tests are commonly performed using either a cycle ergometer or a treadmill, and the physical activity demanded is progressively increased till the patient experiences volitional fatigue.

The purpose of this thesis is to develop an accurate model to predict future values of both the heart rate (HR) and oxygen consumption (VO2) of cardiac patients who perform an incremental exercise test. Our approach analyzes the cardiopulmonary response to the exercise and makes an early prediction of the highest HR and VO2values and also a prediction of future values of such quantities. Since CPETs are physically very demanding, early prediction may be adopted to reduce the duration of execution of the test, thereby lowering the body stress, reducing the associated cost, and improving the efficiency of the equipment. The test can be stopped when the accuracy of the prediction attains satisfactory values.

Two major prediction techniques have been exploited throughout the thesis work: Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs). The implementation of the predictive models has been carried out within the RapidMiner environment.

The experimental results highlight the effectiveness of the proposed approach, as it is able to predict future values of HR and VO2 signals with a limited error, and also its usefulness in clinical applications.

(4)

List of figures

2.1 Overview of the steps composing the KDD process [4] . . . 5

2.2 Classification task [13] . . . 7

3.1 A biological neuron [17] . . . 9

3.2 The MP’s neuron with a threshold logic unit . . . 10

3.3 The Rosenblatt’s Perceptron model of a neuron . . . 11

3.4 A feedforward ANN with one hidden layer . . . 13

3.5 Linearly separable classes . . . 16

3.6 One-dimensional linear regression . . . 17

3.7 Nonlinearly separable classes . . . 17

5.1 Fraction of inspired oxygen . . . 24

5.2 Fraction of expired oxygen . . . 24

5.3 Fraction of inspired carbon dioxide . . . 25

5.4 Fraction of expired carbon dioxide . . . 25

5.5 Fraction of end-tidal oxygen . . . 26

5.6 Fraction of end-tidal carbon dioxide . . . 26

5.7 Pulmonary ventilation . . . 27

5.8 Interval between an R wave and the next R wave . . . 27

5.9 Inspiratory time . . . 28

5.10 Expiratory time . . . 28

5.11 Heart rate . . . 29

5.12 Oxygen consumption . . . 29

6.1 A depiction of the ANN model used for the VO2max prediction . . . 41

6.2 The RapidMiner process using the CBA . . . 43

6.3 The RapidMiner process using the IBA: the outer level . . . 44

6.4 The RapidMiner process using the IBA: the inner level . . . 44

6.5 The RapidMiner process using the IBA: the Validation operator . . . 45

(8)

List of figures vii

6.6 A feedforward ANN with two hidden layers and a single output . . . 46

6.7 k-fold cross-validation [11] . . . 49

6.8 Leave-One-Out Cross-Validation [11] . . . 50

7.1 HRmax, high segment, ANNs vs. SVMs . . . 56

7.2 HRmax, average trend, high segment . . . 57

7.3 HRmax, average trend, high segment . . . 57

7.4 HRmax, medium segment, ANNs vs. SVMs . . . 58

7.5 HRmax, average trend, medium segment . . . 58

7.6 HRmax, average trend, medium segment . . . 59

7.7 HRmax, low segment, ANNs vs. SVMs . . . 59

7.8 HRmax, average trend, low segment . . . 60

7.9 HRmax, average trend, low segment . . . 60

7.10 VO2max, high segment, ANNs vs. SVMs . . . 61

7.11 VO2max, average trend, high segment . . . 62

7.12 VO2max, average trend, high segment . . . 62

7.13 VO2max, medium segment, ANNs vs. SVMs . . . 63

7.14 VO2max, average trend, medium segment . . . 64

7.15 VO2max, average trend, medium segment . . . 64

7.16 VO2max, low segment, ANNs vs. SVMs . . . 65

7.17 VO2max, average trend, low segment . . . 65

7.18 VO2max, average trend, low segment . . . 66

7.19 HRmax varying the window size, high segment . . . 67

7.20 HRmax varying the window size, medium segment . . . 68

7.21 HRmax varying the window size, low segment . . . 68

7.22 VO2max varying the window size, high segment . . . 69

7.23 VO2max varying the window size, medium segment . . . 70

7.24 VO2max varying the window size, low segment . . . 70

7.25 HRmax at intermediate steps, ANNs, high segment . . . 72

7.26 HRmax at intermediate steps, SVMs, high segment . . . 72

7.27 HRmax at intermediate steps, ANNs, medium segment . . . 73

7.28 HRmax at intermediate steps, SVMs, medium segment . . . 73

7.29 HRmax at intermediate steps, ANNs, low segment . . . 74

7.30 HRmax at intermediate steps, SVMs, low segment . . . 74

7.31 VO2max at intermediate steps, ANNs, high segment . . . 76

7.32 VO2max at intermediate steps, SVMs, high segment . . . 76

7.33 VO2max at intermediate steps, ANNs, medium segment . . . 77

(9)

List of figures viii

7.34 VO2max at intermediate steps, SVMs, medium segment . . . 77

7.35 VO2max at intermediate steps, ANNs, low segment . . . 78

7.36 VO2max at intermediate steps, SVMs, low segment . . . 78

7.37 HRnext, ANNs, CBA vs. IBA, high segment . . . 81

7.38 HRnext, SVMs, CBA vs. IBA, high segment . . . 81

7.39 HRnext, ANNs, CBA vs. IBA, medium segment . . . 82

7.40 HRnext, SVMs, CBA vs. IBA, medium segment . . . 82

7.41 HRnext, ANNs, CBA vs. IBA, low segment . . . 83

7.42 HRnext, SVMs, CBA vs. IBA, low segment . . . 83

7.43 VO2next, ANNs, CBA vs. IBA, high segment . . . 85

7.44 VO2next, SVMs, CBA vs. IBA, high segment . . . 85

7.45 VO2next, ANNs, CBA vs. IBA, medium segment . . . 86

7.46 VO2next, SVMs, CBA vs. IBA, medium segment . . . 86

7.47 VO2next, ANNs, CBA vs. IBA, low segment . . . 87

7.48 VO2next, SVMs, CBA vs. IBA, low segment . . . 88

7.49 Effect of the C and Gamma parameters on the MAE . . . 89

7.50 Effect of the Epsilon parameter on the MAE . . . 89

(10)

List of tables

5.1 Description of the recorded signals . . . 23

5.2 Descriptive statistics for the whole segment: monitored signals . . . 32

5.3 Descriptive statistics for the low segment: monitored signals . . . 32

5.4 Descriptive statistics for the medium segment: monitored signals . . . 33

5.5 Descriptive statistics for the high segment: monitored signals . . . 33

5.6 Descriptive statistics for the whole segment: HRmax and VO2max . . . 34

5.7 Descriptive statistics for the low segment: HRmax and VO2max . . . 34

5.8 Descriptive statistics for the low segment: HRmax and VO2max . . . 34

5.9 Descriptive statistics for the high segment: HRmax and VO2max . . . 34

6.1 Segmentation criterion . . . 36

6.2 An extract of the original dataset . . . 37

6.3 An extract of the dataset after applying the temporal aggregation . . . 38

6.4 An extract of the dataset after applying sampling . . . 39

6.5 An extract of the dataset after applying windowing . . . 40

6.6 An extract of the dataset resulting from the prediction of VO2max . . . 52

6.7 Example of the output returned by the script for the calculation of the MAE . 53 7.1 Comparison of the SVMs and ANNs in terms of execution time, HRmax . . . 55

7.2 Comparison of the SVMs and ANNs in terms of execution time, VO2max . . 55

7.3 Comparison of the various window size settings in terms of execution time, HRmax . . . 67

(11)

Acronyms and Abbreviations

ANN Artificial Neural Network CSV Comma Separated Values CBA Community Based Approach CPET Cardiopulmonary Exercise Testing ECG Electrocardiography

GUI Graphical User Interface IBA Individual Based Approach MAE Mean Absolute Error MP McCulloch-Pitts HR Heart Rate

KDD Knowledge Discovery from Data RBF Radial Basis Function

VO₂ Oxygen Consumption SV M Support Vector Machine T LU Threshold Logic Unit W TA Winner Take All

(12)

Chapter 1 Introduction

Recent years have witnessed a tremendous increase in the amount of available data in a huge number of application domains ranging from medicine to e-business. The abundance of data stored in electronic form has made data mining techniques quickly gather strength in many fields. Data mining has gained great interest in the clinical domain as well, being an enabling resource for aiding diagnosis or improving medical procedures.

Incremental tests have become a remarkable and versatile tool used in the clinical evaluation of patients’ health status. A cardiopulmonary exercise test (CPET) is a maximal incremental test capable of providing global information on the nature of exercise limitation and its underlying causes. It can aid in the revealing of abnormal physiological functioning that may occur only when the subject under test undergoes intense physical stress. During the test, both cardiovascular and respiratory responses are taken into account; indeed, in addition to the electrocardiogram (ECG), measures of pulmonary ventilation, oxygen consumption and carbon dioxide production are gathered throughout the test. Being a symptom-limited incremental test, it is commonly performed using a stationary cycle ergometer and the physical activity demanded is progressively increased in terms of the external workload applied to the ergometer flywheel. The cycle’s resistance is gradually increased until the patient reaches his or her limit of tolerance.

The purpose of this thesis is to analyze the individual body response to the exercise and predict future values of the heart rate (HR) and oxygen consumption (VO2) during the test execution. Data mining techniques are used for this purpose as they help to analyze large collections of data and extract useful information for further use. The HR value, defined as the number of times per minute the heart beats, will increase gradually as workload increases, up to the maximum possible, typically in the last step of the test. Hence, the maximal HR value (HRmax) is the highest number of heartbeats per minute achieved by an individual during the test. The maximal VO2(VO2max) is the maximum rate of oxygen taken up and utilized by the

(13)

2

body per minute. Similarly, VO2generally takes its highest value in the last step of the test.

Maximal incremental tests are physically very demanding, because they require individuals to exercise until volitional fatigue. Hence, early prediction of the highest HR and VO2

values or prediction of future values of such quantities may be adopted to reduce the execution duration of the test, thereby reducing the associated cost and improving equipment efficiency.

Further, stopping the exercise test in advance can lower the body stress and prevent exhaustion that may occur in prolonged tests. The test will be stopped when the accuracy of the prediction reaches satisfactory values. The thesis aims at the development of an accurate model to predict future values of both HR and VO2signals of cardiac patients performing an incremental test.

The proposed approach involves the use of two major learning algorithms: Artificial Neu- ral Networks (ANNs) and Support Vector Machines (SVMs). A sample of the physiological data collected during the exercise test has been used as input for the predictive models to predict future values of HR and VO2 signals. RapidMiner, one of the most popular and powerful software suites for machine learning and data mining, was used throughout this thesis work.

The experimental evaluation was performed by using various approaches for the design of the predictive models. The accuracy of the proposed models was assessed by means of the Leave-One-Out procedure and calculating the Mean Absolute Error (MAE). The original dataset, comprising 320 subjects, was partitioned into three groups of similarly performing patients in accordance with the highest value of the workload reached throughout the exercise test. This segmentation phase was used to prevent comparing results between very different patients. Furthermore, as far as the one-step-ahead predictions, two approaches were compared: the Community-based approach (CBA) and the Individual-based approach (IBA).

Briefly stated, in the CBA, predictions are carried out by comparing one patient at a time with the entire community of patients. In contrast, in the IBA, only what the patient has done up to that moment is considered to predict the next time step value of the target attribute. In this thesis, the impact of the windowing operator on the prediction error was also evaluated. This operator changes the characteristics of a dataset passed as input, by specifying the window size, step size, and horizon. The window size indicates the size of the window to be used, the step size dictates how much the window has to advance, and the horizon indicates the prediction horizon used for forecasting.

Besides this first introductory chapter, the thesis is organized as follows. Chapter2gives a brief account of the basic concepts of data mining. Chapter 3 discusses two widely used techniques for predictive analysis: ANNs and SVMs. Chapter4provides information about RapidMiner, an open-source software for data mining. The intention of chapter5is to intro- duce the principles of a CPET and describe the dataset used for the experiments. Chapter6 presents the proposed approach. It includes the actual data preparation part and details re-

(14)

3

garding the models’ configuration and design. The experimental results are highlighted in chapter7. Chapter8draws conclusions and discusses directions for future work.

(15)

Chapter 2 Data Mining

The past few decades have been undergoing a dramatic growth in the amount of available data, collected and stored by companies in large data repositories (i.e. databases, data warehouses).

A sharp rise in digital data availability has created the need for powerful data mining tools in order to gain interesting information. Information extracted from the gathered data may be further used for various applications, ranging from fraud detection to science exploration to market analysis.

This chapter is organized as follows. Section 2.1 introduces the notion of knowledge discovery in databases and explains its relation to data mining. Section 2.2 illustrates the essential steps involved in the knowledge discovery process. Section2.3gives a rough outline of the main techniques of data mining; some of them are discussed in subsequent sections.

2.1 Knowledge Discovery and Data Mining

Knowledge discovery from data (KDD) was initially defined as “the non-trivial extraction of implicit, previously unknown, and potentially useful information from data” by Frawley et al [5]. In the intervening years, Usama Fayyad proposed a more accurate definition referring to the KDD as “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" [4]. According to this definition, data mining is an integral part of the KDD process concerned with finding patterns in data. The other steps involved in the KDD process are discussed in the next section. However, the term data mining is often used in a broader sense to indicate the whole process of discovery knowledge from data.

(16)

2.2 The KDD Process 5

Fig. 2.1 Overview of the steps composing the KDD process [4]

2.2 The KDD Process

The KDD process, illustrated in Figure2.1, is an iterative and interactive process consisting of a sequence of steps. For ease of representation, Figure2.1 does not show all the possible loops that can occur between any two steps in the process. The core steps are summarized below [4].

• Understanding of the application domain in which KDD will take place. It includes the identification of the goal of the KDD process from the customer’s perspective.

• Selection: a target dataset is created by extracting a relevant subset of data from the whole data collection according to some criteria.

• Preprocessing and cleaning: it concerns removing noise and outliers, and handling in- consistent data.

• Transformation: it includes dimensionality reduction and attribute transformation.

• Data mining: it as a crucial step of the entire KDD process involving the use of data mining techniques in order to discover hidden patterns.

• Pattern evaluation: at this stage, mined patterns representing knowledge are evaluated based on some measures.

• Knowledge: the discovered knowledge is finally presented to the user by means of maps, charts, and other graphical representations.

2.3 Data Mining Tasks

Data mining covers a wide range of tasks, generally grouped into two major categories: predictive and descriptive tasks [13].

Predictive tasks. The objective of this task it to predict unknown or future values of a target attribute based one or more predictor attributes. Two types of predictive modeling tasks

(17)

2.4 Predictive Data Mining 6

can be identified: classification and regression. Classification is used for categorical (discrete) variables, while regression for numeric (continuous) variables. Predictive models are built using a set of previously known data, called the training dataset, for which the values of the target variable are assumed to be known a priori. The type of learning involved in the training phase is referred to as supervised learning.

Descriptive tasks. The objective of this task is to explore data in order to discover characteristic features in a given set of data. The main descriptive data mining techniques include clustering and association analysis. Clustering is the process of identifying classes of closely related observations; whereas association analysis is used to find strong relationships among values in a dataset. A well-known application of association analysis is the market-basket analysis. The type of learning used for descriptive purposes is referred to as unsupervised learning.

2.4 Predictive Data Mining: Classification and Regression

Classification and regression are predictive tasks consisting in predicting a target attribute based on one or more predictor attributes. The outcome of a regression model is a continuous value, while the output value of a classification model is discrete.

Classification problems aim to develop a model able to predict a categorical (or binary) target variable when a set of input variables is presented. Such a model is derived through a learning algorithm based on the analysis of a set of already classified data, known as the training set. Figure 2.2 gives a schematic representation of a classification task. A training dataset consisting of data, whose class label is known a priori, is used to build the model.

Many methods for constructing classification models can be applied, such as ANNs, SVMs, and k-nearest neighbor classification. Once the model has been attained, a test set, consisting of unclassified data, is used to assess its accuracy. Classification models are evaluated based on the difference between the number of test records correctly and incorrectly predicted. Ex- amples of real-world classification tasks are spam detection and credit card fraud detection.

On the other hand, regression problems aim to develop a model able to predict a continuous target variable rather than class labels. Similarly to classification, the model learns the relationship between the predicted target attribute and the input attributes by means of a training set. Here, the model accuracy is assessed by computing an error function as the difference between the observed value and the predicted one. Intuitively, the goal of a regression model is to minimize the error function. A common real-world example of regression is the prediction of the price of a stock over time.

(18)

2.4 Predictive Data Mining 7

Fig. 2.2 Classification task [13]

(19)

Chapter 3 Predictive Models

This chapter is proposed to give basic fundamentals of the two predictive models used throughout the current work: ANNs and SVMs.

3.1 Artificial Neural Networks

ANNs represent a powerful tool to solve high complex classification and regression tasks.

Roughly speaking, ANNs are computational systems consisting of a set of processing units, also called neurons, and a set of weighted connections linking units. The connections’ weights between units are adjusted through a learning phase which is required to train the network and build a model.

Since ANNs attempt to replicate the behavior of biological neural networks, a brief introduction to the latter ones is given in section3.1.1. The basic concepts regarding the ANNs and the pros and cons of applying this technique are also presented in this section.

3.1.1 Fundamentals of Biological Neural Networks

Biological neural networks are composed of nerve cells, known as neurons [17]. Figure3.1 depicts the structure of a neuron. A neuron consists of a cell body (or soma) and its branches.

At the input end of the cell, a set of dendrites is used to convey information from other neurons.

Similarly, information generated by the neuron is moved along a narrow fiber, also known as axon, down to the synapses, whose outputs provide the input of the next neuron dendrites.

In the human brain, chemical synapses contain thousands of molecules of neurotransmitters, which raise and lower the electrical potential inside the body cell of the receiving neuron.

When the neuron’s electrical potential reaches an established threshold, the neuron is activated and a signal (consisting of electrical or nerve impulses) is transmitted throughout the axon to

(20)

3.1 Artificial Neural Networks 9

Fig. 3.1 A biological neuron [17]

the next neuron. Thus, connections between neurons let the signal flows from one neuron to another, letting the behavior of the receiving neuron be influenced by the sending one.

3.1.2 Historical Background

Warren McCulloch and Walter Pitts took the first key step on the development path of neural network principles in 1943 [15]. They proposed a simple mathematical model of biological neurons to explain their functioning. The McCulloch and Pitts (MP)’s neuron is conceived as a threshold logic unit (TLU), which only deals with binary values. A schematic diagram of a TLU is shown in Figure3.2. The computing unit takes two inputs with weights assumed to be 1 and generates a binary output. The neuron remains inactive, that is the output remains zero, until the weighted sum of the inputs does not meet a certain threshold q (usually set equal to 1). Although their study was mainly focused on the understanding of the human brain-behavior, many scientists interested in the ANN field took advantage of it. Thus, the end of the 1950s witnessed the implementation of the Perceptron model based on the MP’s neuron principles [18]. A more detailed look into the Perceptron is given in section3.1.3. Single layer perceptrons could solve pattern recognition and make associations but were not able to solve nonlinear problems.

The computational limitation of the Perceptron, pointed out by Marvin Musky in 1968 [16], led to a severe slowdown in the neural networks study for many years. One of the most revo- lutionary outcomes, which led to the reemergence of interest in the neural networks in 1974, was the publication of the learning backpropagation algorithm attributed to Werbos [21].

In the past 40 years, many advances have occurred with the introduction of the concepts of competitive learning, self-organization feature maps, adaptive resonance theory and simulated annealing.

(21)

Fig. 3.2 The MP’s neuron with a threshold logic unit

3.1.3 Perceptron

In 1958, an American psychologist named Frank Rosenblatt proposed the Perceptron as a computational model of the retina of the eye [18]. Its basic network structure comprises three layers: an input layer of sensory units S, often called a ‘retina’, a middle layer of association units A, and an output layer of response units R (refer Figure3.3). The S layer is partly connected to the A layer through a set of unchanging randomized connections. Instead, connections between the association and output layer are variable weights and their weights could be adjusted during a training process via an error-propagating based procedure. The association unit layer performs predetermined computations on the binary values transmitted by the S layer. The desired output is compared with the calculated response and the error is used to update the weights.

More specifically, a Perceptron with N inputs returns a binary output signal y writable as:

y = f⇣^N

i=1

Â

x_iw_i q⌘

First, a weighted sum of the input values is performed, then its value is compared to a threshold q. The output of an R unit produced is either 0 or 1, using the following role:

y = 8>

<

>:

0, if Â^N

i=1x_iw_i<q.

1, otherwise.

(3.1)

Only one R unit can be activated at a time due to the winner-take-all (WTA) behavior of the network. Once the output signal is generated, the error between the actual value and the desired one is computed. It used in the learning phase to adjust the weights of the network.

The Rosenblatt’s Perceptron was originally developed to function as a pattern classifier

(22)

Fig. 3.3 The Rosenblatt’s Perceptron model of a neuron

by distinguishing between two given pattern classes. In fact, the Perceptron is guaranteed to converge only if it handles linearly separable tasks. Its theoretical weakness was shown by Marvin Musky by illustrating several examples. In spite of that, the Perceptron had a huge impact on successive ANN implementations due to the introduction of numerical weighted connections in the network. More generalized models and variants, including the multi-layer feedforward perceptron were designed.

3.1.4 Network Topology

Topology refers to the network architecture defined as the collection of the processing units, connections, and pattern input/output [19]. ANNs are organized into layers of processing units. Connections between units of different layers make information flows through the network. Depending on the nature of the connections either feedforward or feedback networks, are possible. There are different ways in which information can be processed by a neuron, and different ways in which neurons can be interconnected. Several neural network structures can be built by using different elements and specifying the form in which they are connected.

An ANN consists at least of two layers of processing units: an input layer and an output layer. One or more hidden layers may also be added. In a feedforward neural network, information flows gradually in the forward direction from the input down to the output layer.

(23)

In contrast, in a feedback neural network, information flows in a bidirectional way, therefore a network unit can be visited more than once. A brief description of a feedforward ANN is given in section 3.1.6.

3.1.5 Learning Algorithms

ANN learning algorithms can be broadly classified as supervised or unsupervised.

Supervised learning. The weight adjustment is determined based on the difference between the desired output and the actual output. A set of examples, called a training set, is used to develop the model. Each instance of the training set is associated with the corresponding target output, which is assumed to be known a priori. The training process is performed until the network learns to infer the relationship between the input values and the output values associated with them. Hence, supervised learning algorithms are error-based learning algorithms since the error between the desired output and the calculated one is computed with the goal of changing the network’s weights and other parameters, thereby improving its performance.

This type of learning is commonly used to solve regression or classification tasks. The most frequently used supervised learning method is the backpropagation algorithm.

Unsupervised learning. In contrast, in an unsupervised approach the ANN does not use any information regarding the training data. Indeed, no desired outputs are supplied to the ANN, which updates the weights based on local information. In this learning method, the network is intended to be self-organized because it learns of its own by discovering characteristic features in a given set of input data. This type of learning is commonly used for clustering purposes.

3.1.6 Feedforward Neural Networks

As mentioned earlier, a feedforward neural network consists of a set of processing units organized in two or more layers so that information flows from a layer to the next one in a forward manner. A feedforward network comprises an input layer, N hidden layers, and an output layer. The number of hidden layers N can also be equal to zero. In common usage, one or two hidden layers are selected. A portray of a typical feedforward neural network with a single hidden layer is given in Figure3.4.

Input signals from the input layer are propagated towards the next layer in the network up to the output layer. A processing phase is required. First, each input is multiplied by an appropriate value, knows as weight, and then the summation of the multiplications is passed through an activation function to produce the output signal. A widely deployed activation function is the sigmoid function. The weights are associated with every connection and node

(24)

Fig. 3.4 A feedforward ANN with one hidden layer (single output)

in the network and their size is used to indicate the importance of the inputs in the model. Final weight values are iteratively determined by a learning procedure, such as backpropagation.

Usually, three layers can be identified. They are [8]:

• An input layer. Inputs in the input layer are weighted and passed to a second layer of neurons.

• A hidden layer (one or more). Each unit in the hidden layer receives as input the weighted sum of the outputs of the previous layer and uses an activation function to produce the output. The hidden layer can be directly connected to the final output layer or to another hidden layer.

• An output layer. Each output unit receives as input the weighted sum of the outputs of the hidden layer. An activation function is then applied to obtain the network’s output.

To build a feedforward neural network, some parameters need to be set. In particular, the number of hidden layers and the number of neurons in each hidden layer must be defined.

Several experiments have revealed that choosing a number of hidden layers equal to 1 or 2 is sufficient for achieving the majority of tasks.

3.1.7 The Backpropagation Algorithm

Backpropagation is the most widely employed algorithm in supervised learning. It trains the ANN with a gradient descent procedure in order to minimize the error function. The

(25)

algorithm is devoted to computing the gradient of the error function at each iteration step by comparing the desired output with the actual one. Throughout the procedure, the weights are adjusted over the network according to the descending gradient direction. When a set of weights which minimizes the error sum of squares is found, the learning process stops.

Since the backpropagation algorithm takes the derivative of the error function, it needs to deal with a feedforward neural network (network which doesn’t contain any cycles) to ensure the differentiability of the error function.

Two steps are encompassed within the procedure [3]:

• Feedforward: during the forward propagation the input signal passed to the network is propagated up to the output layer and the error function is then estimated.

• Backpropagation: the deviation of the actual output from the desired output defines the error signal. This error is propagated backward from the output layer down to the input layer through each hidden layer. The network’s weights are continuously updated on the basis of the error value to minimize.

The logical steps for training an ANN with the backpropagation algorithm are described below [6]:

0. initialize the network’s weights to random small values;

1. pick a training pattern at random;

2. calculate the output response of the network;

3. compute the error at the outputs;

4. adjust the weights using the error signal;

5. if the error is too large return to step 1. Usually, the procedure stops when the error is lower than a certain acceptable value or when a given number of iterations is reached.

3.1.8 Strengths and Weaknesses

ANNs offer a number of advantages, including:

• Possibility of dealing with a large number of variables and parameters,

• Capability of handling noisy or missing data,

• Robustness against faults,

(26)

3.2 Support Vector Machines 15

• Finding a solution for nonlinear systems.

However, if an ANN approach is adopted some weaknesses are also present. The main ones are:

• Non-trivial interpretability of the model (often referred to as a “black box”),

• Overfitting tendency,

• Huge amount of time required in the learning process.

3.2 Support Vector Machines

The intention of this section is to outline the underlying concepts of SVMs, a set of supervised learning methods developed by Vladimir Vapnik and his co-workers in 1990s. SVMs are used to solve both classification and regression problems. The basic idea of SVMs is to map data into a feature space where an optimal hyperplane is constructed.

Firstly, the principles of linear SVMs are presented in section3.2.1. Further, a nonlinear approach achieved through the kernel trick, is illustrated in section 3.2.2. The Strengths and weaknesses of the SVM methodology conclude this brief overview.

3.2.1 Linear SVMs

Let’s focus first on the easiest classification problem, concerning the binary classification of linearly separable data. Figure3.5 shows the case of 2-dimensional space with two linearly separable classes.

An SVM classifier attempts to separate the two classes by a function that is induced by available data, knows as the training data. The classifier will be able to generalize on unseen data, once the learning phase ends. In simple terms, given a set of training data belonging to the two classes, the goal is to find an optimal hyperplane that separates the classes maximizing the margin. The margin is defined as the distance between the closest data and the hyperplane.

Let’s consider Figure3.5. There are infinite numbers of possible lines that can be drawn to separate all of the training data vectors of class 1 (in gray) from data of class 2 (in black). As a result of that, infinite hyperplanes can be constructed to correctly classify the data, but the one with the maximum distance between the nearest training data vector and the hyperplane is cho- sen to represent the classifier decision boundary [13]. Consequently, the hyperplane with the largest separation between the classes (also known as the maximal marginal hyperplane [8]) yields the best generalization error. Intuitively, the maximal marginal hyperplane tends be the

(27)

Fig. 3.5 Linearly separable classes

most accurate at classifying future unseen data, as it is not affected by slight perturbations to the decision boundary [13].

In the case of SVM’s regression, the purpose is to construct a hyperplane with the biggest number of points that lie on it. Consider Figure3.6, the training data vectors are required to be positioned within thee-tube around the hyperplane. The e-tube is used to define the loss function, thereby penalizing prediction errors. In fact, errors situated within a certain distance e (in gray) from the hyperplane are ignored, whereas errors greater than the threshold e (in black) are penalized and incur a loss [7]. The loss function increases proportionally to the distance of the input vector from thee-tube.

3.2.2 Nonlinear SVMs

Let’s take the case of 2-dimensional space with two classes that are nonlinearly separable. An example is illustrated in Figure3.7.

In the example no straight separating line can be found between the two classes, however, it is easy to see that a circular hyperplane can correctly separate the classes. In such a case, the linear approach described so far can be extended to nonlinearly separable data by means of nonlinear transformations. Indeed, the original data can be mapped into a higher dimensional space where a linear separating hyperplane is to be found [8]. The optimal hyperplane found

(28)

Fig. 3.6 One-dimensional linear regression

Fig. 3.7 Nonlinearly separable classes

(29)

in the higher dimensional space is equivalent to a nonlinear separating hypersurface in the original space. A set of mathematical functions, known as kernel functions, is used to map data into a new feature place (higher dimensional) where a linear classification is performed.

This method is usually known as the kernel trick. The linear classification performed in the new space corresponds to the nonlinear classification in the original space.

As in the SVM classifiers, the kernel trick is used for nonlinear mapping in solving regression problems. Similarly, the linear regression performed in the new space corresponds to the nonlinear regression in the original space.

Several kernel functions can be employed to transform the original data, including:

• Linear function,

• Polynomial function,

• Radial Basis Function (RBF),

• Sigmoid function.

3.2.3 Strengths and Weaknesses

The strengths of SVMs can be summarized as [12]:

• Flexibility in application: application areas range from pattern recognition to fraud detection to bioinformatics analysis,

• Robustness,

• Scalability,

• Overfitting resistance, unlike the ANNs.

The major weakness of this technique is concerned with the choice of the suitable kernel function to successfully implement the SVM. Moreover, high computational cost may be required when dealing with high dimensional datasets.

(30)

Chapter 4 Introduction to RapidMiner

Almost all of the data mining tasks explained in this thesis have been fulfilled through the Community Edition of RapidMiner (version 5.3). For this reason, this chapter describes some basic concepts of RapidMiner.

RapidMiner, formerly known as YALE, is one of the most popular and powerful software suites for machine learning and data mining. It is an open-source tool developed by Rapid-I and distributed under the AGPL license. RapidMiner provides an easy-to-use graphical user interface (GUI) environment for the design of analysis processes. It offers functionalities like data integration, analytical ETL, data transformation, data analysis and visualization in a single application. It is written in the Java programming language and can be executed on any platforms.

RapidMiner provides more than 500 operators for building models and manipulating data.

It allows users to visually represent a data mining process by creating a tree of operators.

Operators need to be selected and properly configured by the user specifying the values of their parameters. An operator can be included in the process by simply selecting it from the Operator View and dragging into the working area, known as the Process View. Operators are presented in groups according to their functionality. The main groups are listed hereunder:

• Process control: contains operators that control the process flow (i.e. Loop);

• Utility: contains auxiliary operators (i.e. Log );

• Repository access: includes at least two operators for read and write access to repositories;

• Import: comprises operators to read data from external formats (i.e. Read CSV);

• Export: comprises operators to write data into external groups (i.e. Write CSV);

(31)

20

• Data transformation: consists of those operators used for data transformation, such as type conversion and data cleaning;

• Modeling: among several operators available includes those for classification and regression tasks;

• Evaluation: contains operators used to compute the quality of a model (i.e. X-Validation).

More details of the operators used throughout the thesis are given in chapter6.

Following is a summary of the general characteristics of RapidMiner:

• Powerful and intuitive GUI for the design of analytical processes;

• Compatibility with all operating systems;

• Data integration, analytical ETL, data transformation, and data analysis functionality;

• Additional functionalities provided by Extensions (i.e. Text Mining Extension);

• Possibility of defining reusable building blocks;

• On-the-fly error detection and quick suggested fixes.

(32)

Chapter 5 Characteristics of the Dataset

This chapter describes the basic concepts related to the exercise methodology, protocols and equipment for CPET. The dataset used for the experiments and the physiological signals collected during the incremental cardiopulmonary test are also discussed.

5.1 Cardiopulmonary Exercise Testing

CPET has become an outstanding tool for the routine clinical evaluation of patients’ health status. A CPET is a maximal incremental test suited to supply comprehensive information on the nature of exercise limitation and its underlying causes. Thus, a CPET is a non-invasive procedure that can aid in the revealing of abnormal physiological functioning that may occur only when the subject under test undergoes intense physical stress. During the test, both cardiovascular and respiratory response to incremental exercise are borne in mind; indeed, in addition to the ECG, measures of pulmonary ventilation, oxygen consumption, and carbon dioxide production are gathered throughout the test.

5.2 Exercise Modality and Protocol

To properly assess the functional capacity of a patient, the exercise testing should involve large muscle groups, such as the lower extremity muscles in pedaling a cycle or running on a treadmill [10]. Therefore, exercise testing is commonly performed using either a stationary cycle ergometer or a motorized treadmill. Since CPET is a symptom-limited incremental test, the demanded physical activity, quantifiable in terms of the external workload, should be progressively increased to a symptom-limited maximum; both stepwise and continuous workload increments can be employed. The rate of the workload progression can be selected

(33)

5.2 Exercise Modality and Protocol 22

arbitrary, even if it is recommended to choose the workload so that it yields fatigue-limit exercise duration of about 10 minutes [2].

All CPETs should begin with a warm-up phase, followed by an incremental exercise and a subsequent recovery period [20]. The maximal test includes a warm-up phase lasting few minutes, during which the subject is asked to walk on a treadmill slowly or pedal a cycle ergometer with no additional resistance applied to it. Afterward, the incremental exercise starts: the treadmill’s speed or the cycle’s resistance is progressively increased until the patient reaches volitional exhaustion. Finally, an adequate recovery period is recommended; the subject keeps walking on the treadmill as it slows down or keeps pedaling the cycle ergometer as the external resistance is taken off the cycle. Early exercise termination may occur if the patient reports adverse symptoms, such as chest pain and breathless or if the ECG reveals severe abnormal hearth activity.

In our study, the maximal test has been performed using a stepwise incremental protocol on a cycle ergometer. After 2 minutes of unloaded pedaling resistance, an incremental rate of 5 W has been applied approximately every 30 seconds. The test has been stopped due to volitional fatigue, reached when the patient has no longer been able to pedal at the current rate. The CPET has been used for the functional evaluation of cardiac patients.

The equipment frame contains a plurality of modules including:

1. A cycle ergometer,

2. A electrocardiographic (12-lead ECG), 3. A pneumotachograph,

4. A gas analyzer,

5. A cuff sphygmomanometer.

An ECG system is essential for continuous heart activity monitoring, accomplished by means of electrodes placed on the limbs and chest wall. A pneumotachograph, connected to the airways through a mouthpiece, is designed to measure the pulmonary ventilation. Moreover, a gas analyzer is necessary to provide the concentration of oxygen and carbon dioxide in expired and inspired air. Finally, a cuff sphygmomanometer is used to measure blood pressure during the test. The subject breathes through a special mouthpiece connected with the equipment set that allows breath-by-breath measurements of physiological signals.

CPETs are physically very demanding because they require individuals to exercise until volitional fatigue. Hence, early prediction of HRmax and VO2max values can be adopted to reduce the duration of execution and prevent patients exhaustion; the test will be stopped when the accuracy of the prediction attains satisfactory values.

(34)

5.3 Description of the Signals 23

5.3 Description of the Signals

Starting from a raw dataset consisting of 481 tests, 320 were selected through a data preprocessing step in a previous work. All along the incremental test, a set of sensors has been used to record physiological signals, as reported in Table5.1. During the test, both cardiovascular and respiratory responses are taken into account; indeed, in addition to the ECG, some signals such as the pulmonary ventilation, oxygen consumption and carbon dioxide production are measured breath-by-breath throughout the test.

Table 5.1 Description of the recorded signals

Signal Abbreviation Measurement unit

Fraction of inspired oxygen FIO2 %

Fraction of expired oxygen FEO2 %

Fraction of inspired carbon dioxide FICO2 %

Fraction of expired carbon dioxide FECO2 %

Fraction of end-tidal oxygen FETO2 %

Fraction of end-tidal carbon dioxide FETCO2 %

Pulmonary ventilation VE L·min^-1

Interval between an R wave and the next R wave RR s

Inspiratory time TI s

Expiratory time TE s

Heart rate HR bpm

Oxygen consumption VO2 L·min^-1

One patient representative of the whole dataset has been picked out in order to provide an easily visible example of the trends of the signals (Figures5.1to5.12). The patient taken into account has reached the step no.14 of the test.

(35)

0.2039 0.204 0.2041 0.2042 0.2043 0.2044 0.2045 0.2046 0.2047 0.2048

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FIO2

SAMPLE

Fig. 5.1 Fraction of inspired oxygen

0.16 0.162 0.164 0.166 0.168 0.17 0.172 0.174 0.176 0.178

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FEO2

SAMPLE

Fig. 5.2 Fraction of expired oxygen

(36)

0 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FICO2

SAMPLE

Fig. 5.3 Fraction of inspired carbon dioxide

0.028 0.029 0.03 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.038

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FECO2

SAMPLE

Fig. 5.4 Fraction of expired carbon dioxide

(37)

0.13 0.135 0.14 0.145 0.15 0.155 0.16 0.165 0.17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FETO2

SAMPLE

Fig. 5.5 Fraction of end-tidal oxygen

0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.056

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 FETCO2

SAMPLE

Fig. 5.6 Fraction of end-tidal carbon dioxide

(38)

0 10 20 30 40 50 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

VE L·min-1]

SAMPLE

Fig. 5.7 Pulmonary ventilation

0 5 10 15 20 25 30 35 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

RR [s]

SAMPLE

Fig. 5.8 Interval between an R wave and the next R wave

(39)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

TI [s]

SAMPLE

Fig. 5.9 Inspiratory time

0 0.5 1 1.5 2 2.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

TE [s]

SAMPLE

Fig. 5.10 Expiratory time

(40)

0 20 40 60 80 100 120 140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

HR [bpm]

SAMPLE

Fig. 5.11 Heart rate

0 0.2 0.4 0.6 0.8 1 1.2 1.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 VO2[L·min-1]

SAMPLE

Fig. 5.12 Oxygen consumption

(41)

5.3.1 Calculation of VO

₂

The volume of VO2, has not been directly monitored during the test rather has been estimated by taking the difference between the volume of inhaled and exhaled oxygen [14].

VO₂=VO₂_I VO₂_E

The volume of O2in the inspired and expired air can be determined as follows:

VO₂_I = f_{ST PD}·V I · FIO2

and

VO₂_E = f_{ST PD}·V E · FEO2

f_{ST PD}is the factor of Standard Temperature and Pressure Dry air and it allows comparison between values regardless of the temperature and pressure conditions at which they are collected.

It can be expressed as:

f_{ST PD}=✓ 273°K 273°K + TA

◆

·✓PBAR P_H₂_O 760mmHg

◆

where PBAR is the ambient barometric pressure and PH₂O is the water vapor pressure at a particular temperature TA. In our study, we assume TA =36°C and, consequently, PH₂O= 44.6mmHg. Furthermore, VI can be derived from VE through the following equation, known as Haldane transformation:

V I = V E ·✓1 FEO2+FECO₂ 1 FIO2+FICO₂

◆

Following the final form of the equation used for the computation of VO2 is provided:

VO₂= f_{ST PD}·V E ·

"✓1 FEO2+FECO₂ 1 FIO2+FICO₂

◆

· FIO2 V E · FEO2

#

(42)

5.4 Description of the Quantities of Interest 31

5.4 Description of the Quantities of Interest

Our study aims at analyzing the individual body response to the exercise in order to predict future values of both the HR and VO2during the test execution. The HR value, defined as the number of times per minute the heart beats, will increase gradually as workload increases, up to the maximum possible, typically in the last step of the test. The highest number of heartbeats per minute achieved by an individual during the test is referred to as the maximal HR. Instead, the maximal VO2 corresponds to the maximum rate of oxygen taken up and utilized by the body per minute. Similarly, VO2 generally takes its highest value in the last step of the test.

Early prediction of the HRmax and VO2max values and one-step-ahead prediction of such quantities (HRnext, VO2next) are performed in the current work. The objective is to reduce the duration of execution the test, stopping it when the accuracy of the prediction attains satisfactory values.

5.5 Descriptive Statistics

In order to have an overall picture of the dataset, in Table5.2 are reported the main characteristics of our data, including the mean, standard deviation, minimum, and maximum. These descriptive statistics are of great help in identifying the distribution of the monitored signals that are used as input attributes for the predictive models. In a like manner, descriptive data summaries are provided for the target attributes HRmax and VO2max (Table5.6). The considered dataset contains all the 320 patients executing the CPET.

In accordance with a pre-defined set of workload ranges, the entire dataset has been partitioned into three groups of similarly performing patients. A deeper discussion of the segmentation procedure is available in section6.1.1. The obtained segments are denoted as low, medium, and high and contain 150, 125 and 40 tests respectively. Basic descriptive statistics are also computed for each one of the three segments to understand the data behavior in terms of central tendency and dispersion. They are given in the following tables.

(43)

5.5 Descriptive Statistics 32

Table 5.2 Descriptive statistics for the whole segment: monitored signals Signal Mean Standard deviation Minimum Maximum

FIO2 0.2050 0.0022 0.2006 0.2104

FEO2 0.1706 0.0069 0.1409 0.1953

FICO2 0.0010 0.0003 0.0004 0.0030

FECO2 0.0344 0.0061 0.0145 0.0524

FETO2 0.1522 0.0096 0.1178 0.1907

FETCO2 0.0501 0.0065 0.0247 0.0736

VE 27.1860 12.0506 2.6665 79.6342

RR 23.7081 5.9326 5.8730 52.4769

TI 1.2161 0.3456 0.5433 4.3920

TE 1.5467 0.5109 0.5289 6.5760

HR 99.3269 20.3648 57.4286 173.500

VO2 0.7687 0.3192 0.0418 2.3061

Table 5.3 Descriptive statistics for the low segment: monitored signals Signal Mean Standard deviation Minimum Maximum

FIO2 0.2048 0.0020 0.2011 0.2098

FEO2 0.1743 0.0058 0.1517 0.1953

FICO2 0.0010 0.0003 0.0005 0.0029

FECO2 0.0310 0.0053 0.0145 0.0493

FETO2 0.1554 0.0090 0.1322 0.1907

FETCO2 0.0473 0.0063 0.0247 0.0664

VE 24.2127 9.2206 2.6665 61.8198

RR 25.1361 5.9063 8.6645 52.4769

TI 1.1428 0.3016 0.5433 3.7840

TE 1.4342 0.4126 0.5289 4.0144

HR 98.30 19.0 57.5 173.5

VO2 0.6041 0.1956 0.0418 1.2202

(44)

Table 5.4 Descriptive statistics for the medium segment: monitored signals Signal Mean Standard deviation Minimum Maximum

FIO2 0.2049 0.0021 0.2006 0.2104

FEO2 0.1699 0.0064 0.1470 0.1907

FICO2 0.0010 0.0002 0.0004 0.0027

FECO2 0.0349 0.0053 0.0166 0.0524

FETO2 0.1520 0.0092 0.1214 0.1836

FETCO2 0.0502 0.0058 0.0325 0.0661

VE 27.9656 12.2017 6.5190 76.3127

RR 23.0695 5.7140 6.0560 46.7554

TI 1.2419 0.3485 0.6023 3.4832

TE 1.5948 0.5236 0.6863 6.5760

HR 99.6137 20.0409 58.1667 163.1667

VO2 0.7948 0.2826 0.1745 1.7605

Table 5.5 Descriptive statistics for the high segment: monitored signals Signal Mean Standard deviation Minimum Maximum

FIO2 0.2057 0.0023 0.2014 0.2100

FEO2 0.1657 0.0059 0.1409 0.1855

FICO2 0.0010 0.0003 0.0005 0.0030

FECO2 0.0392 0.0052 0.0186 0.0515

FETO2 0.1469 0.0087 0.1178 0.1711

FETCO2 0.0548 0.0053 0.0411 0.0736

VE 30.7575 14.6228 8.5430 79.6342

RR 22.5543 5.9501 5.8730 43.5506

TI 1.2901 0.3858 0.6745 4.3920

TE 1.6421 0.5974 0.6997 6.1973

HR 100.5126 23.0793 57.4286 172.8182

VO2 1.0030 0.3960 0.2725 2.3061

(45)

Table 5.6 Descriptive statistics for the whole segment: HRmax and VO2max Quantity of interest Mean Standard deviation Minimum Maximum

HRmax 125.0903 18.3993 72.6250 173.5000

VO2max 1.1123 0.3176 0.4843 2.3196

Table 5.7 Descriptive statistics for the low segment: HRmax and VO2max Quantity of interest Mean Standard deviation Minimum Maximum

HRmax 118.9 18.9 72.6 173.5

VO2max 0.8663 0.1512 0.4843 1.2202

Table 5.8 Descriptive statistics for the low segment: HRmax and VO2max Quantity of interest Mean Standard deviation Minimum Maximum

HRmax 128.9256 15.2356 79.3333 163.1667

VO2max 1.2216 0.1824 0.7233 1.7605

Table 5.9 Descriptive statistics for the high segment: HRmax and VO2max Quantity of interest Mean Standard deviation Minimum Maximum

HRmax 135.1177 17.5662 101.3750 172.8182

VO2max 1.6286 0.2177 1.1592 2.3196

(46)

Chapter 6 Description of the adopted approach

The chapter presents the proposed approach, including the actual data preparation phase (sec- tion6.1) and a step-by-step description of the implementation of the ANN- and SVM-based models with RapidMiner. Lastly, the leave-one-out procedure and calculation of the MAE, used for the evaluation of the predictive models, are discussed in sections6.3and6.4.

6.1 Data preparation

Starting from a raw collection of data, a first step of data preprocessing was carried out in a previous thesis work. As a result of that, a collection of data free of inconsistency, noise and outliers was achieved.

Further preprocessing steps have been performed to make data more suitable for data mining modeling.

6.1.1 Segmentation

The entire dataset consisting of 320 individuals has been partitioned into three groups according to the highest value of the workload, known as Wpeak, reached throughout the exercise test. Each group contains similarly performing patients since data segmentation has been achieved by taking into consideration the Wpeak value, which provides a reliable indicator of the strength and endurance of the subject. Thus, segmentation has been exploited to prevent comparing results between very different patients, thereby enhancing the performance of the prediction models. Based on a predefined set of ranges, three segments denoted by low, medium and high, have been considered during the segmentation phase, as shown in Table6.1.

Table6.1also reports the number of tests for each segment.

(47)

6.1 Data preparation 36

Table 6.1 Segmentation criterion

RANGE FROM [W] TO [W] NO. TESTS

low 50 80 150

medium 85 115 125

high 120 140 45

6.1.2 Temporal aggregation

The initial dataset includes:

• the patient identifier;

• the instants of time at which the patient breaths during the test, denoted by TIME;

• a set of physiological signals collected breath-by-breath (i.e. FIO2, FICO2);

• the external workload applied (WL).

Table 6.2contains an extract of the original dataset; each column represents an attribute and each record an observation. Each record in the dataset comprises a collection of quanti- tative metrics, which represent the physiological state of a particular patient undergoing the CPET.

Firstly, the attribute values have been averaged with a sliding window approach in order to summarize such a large dataset and to get less variability. Using a Python script, a sliding time window of fixed time intervals has been moved across the rows and the average values of the original time series have been calculated. In particular, the deployed algorithm let the window progress over the data, calculates a moving average for each attribute and produces an output row per window.

(48)

6.1 Data preparation 37

Table6.2Anextractoftheoriginaldataset IDTIMEFIO2FEO2FICO2FECO2FETO2FETCO2VERRTITEHRWL 1549860.204850.165697.30E-040.034670.143550.0538813.97316.631.2722.328790 1549890.204730.177248.00E-040.024350.144670.0523611.42521.6141.0881.688790 1549920.204390.165267.60E-040.034610.142880.0537914.96817.2411.0962.376795 1549950.20420.178210.001330.023660.138920.0553112.72426.3160.8321.44805 1549970.20420.168948.60E-040.030790.140520.0548713.13219.281.0082.088805 1550020.204670.165757.30E-040.034080.142210.0536313.73415.3371.1922.704805 1550060.205060.163746.40E-040.036190.143940.0529414.87813.2741.4323.08795 1550100.204780.171487.40E-040.029470.145780.0517513.41618.1161.282.024795 1550130.20460.16877.40E-040.031950.144360.0527513.70816.8921.2642.288795 1550170.204880.167497.30E-040.033230.14520.0524513.14115.061.322.648795 1550210.204750.165897.10E-040.034640.146330.0516816.5216.5561.3122.312795 ...

Analysis of the cardiopulmonary response in incremental exercise tests

M.S. Degree in Computer and Communication Networks Engineering

Master’s Thesis