POLITECNICO DI MILANO
S
CHOOL OF
I
NDUSTRIAL AND
I
NFORMATION
E
NGINEERING
Department of Mathematics
Master of Science in Mathematical Engineering
Count processes approach
to recurrent event data:
a Bayesian model for blood donations
E
NRICOS
PINELLIM
ATRICOLA: 875462
S
UPERVISOR:
P
ROF. A
LESSANDRAG
UGLIELMIC
OADVISOR:
P
ROF. E
TTOREL
ANZARONEAbstract
T
his work tries to give a solution to a very important and practical issue: the prediction of the number of donations in a specific blood centre, in order to efficiently plan the collection phase of the blood supply chain.First, statistical models for estimation of the rate of blood donations are considered. This kind of models allows to predict the return time to donation for an individual. The real data that have been analyzed come from the Milan section’s databases of Associazione Volontari Italiani Sangue (AVIS). The class of models and methods used are those of Bayesian Statistics, and blood donations have been modeled as recurrent events. Specifically, the focus has been on the rate function, which is the instantaneous probability of the event occurrence. The object of the inference of this approach is the counting process {Ni(t) : t ≥ 0}, for each donor i, where Ni(t) represents the number of
donations made at time t by the i − th donor.
Usually the waiting times between donations are considered, but, on the other hand, modeling the counts allows the process to retain memory and to take place with a different occurrence rate depending on the time of the event.
The analysis highlights a decreasing trend of the rate function and identifies some significant covariates. Moreover, with the use of random effects in the model, hetero-geneity among individuals is captured and for each donor the posterior density of one parameter (called frailty) summarises his/her personal propensity to donate.
The behaviour of existing donors has been modeled within the context of recurrent events. Since the supply of blood is given also by occasional donors or new donors, a Bayesian time series model has been proposed to make prediction in this context.
Estratto in lingua italiana
Q
uesto lavoro cerca di dare una soluzione a un problema molto importante e pratico: la previsione del numero di donazioni in un centro di raccolta di sangue specifico, al fine di pianificare in modo efficiente la fase di raccolta della catena di approvvigionamento del sangue.Innanzitutto sono stati considerati i modelli statistici per la stima del tasso di don-azioni di sangue. Questo tipo di modelli consente di prevedere il tempo di ritorno alla donazione per un individuo. I dati reali che sono stati analizzati provengono dai database della sezione di Milano dell’Associazione Volontari Italiani Sangue (AVIS). La classe di modelli e metodi utilizzati sono quelli della statistica Bayesiana e le donazioni di sangue sono state modellizzate come eventi ricorrenti. Nello specifico, l’attenzione si è concentrata sulla rate function, che è la probabilità istantanea del verificarsi dell’evento. L’oggetto dell’inferenza di questo approccio è il processo di conteggio {Ni(t) : t ≥ 0}, per
ogni donatore i, dove Ni(t) rappresenta il numero di donazioni fatte fino al tempo t
dall’i − esimo donatore .
Di solito si considerano i tempi di attesa tra le donazioni, ma la modellazione dei conteggi consente al processo di conservare la memoria e di svolgersi con un tasso di occorrenza diverso in base al passare del tempo.
L’analisi evidenzia una tendenza a decrescere della rate function e identifica alcune covariate come significative. Inoltre, con l’inclusione di random effects nel modello, l’eterogeneità tra gli individui viene spiegata e per ogni donatore la distribuzione a posteriori di un parametro (chiamato frailty) riassume la sua personale propensione alla donazione.
Il comportamento dei donatori esistenti è stato modellizzato nel contesto di eventi ricorrenti. Poiché la fornitura di sangue è data anche da una componente fornita da donatori occasionali o nuovi donatori, un modello Bayesiano per serie storiche è stato proposto per fare previsioni di questo fenomeno.
Table of Contents
Abstract I
Estratto in lingua italiana III
Table of Contents V
List of Figures IX
List of Tables XI
Introduction 1
1 Theoretical background on modelling recurrent events 5
1.1 Framework and notation . . . 5
1.2 Recurrent events as gap times . . . 6
1.3 Recurrent events as event counts . . . 7
1.4 Heterogeneity between individuals . . . 8
1.4.1 Covariates in the Poisson process . . . 8
1.4.2 Random effects . . . 9
1.5 Extensions to renewal and Poisson processes . . . 11
1.5.1 "At risk" indicator function . . . 11
1.5.2 General intensity-based model . . . 12
1.5.3 Multi-state Markov models . . . 13
1.5.4 Modelling the baseline intensity function . . . 13
1.6 The Bayesian approach . . . 15
1.6.1 Bayesian Statistics . . . 15
1.6.2 Monte Carlo Markov Chains . . . 16
1.6.4 Autocorrelated prior for the baseline intensity function . . . 17
1.7 Model evaluation in terms of predictive performances . . . 18
1.7.1 Log posterior predictive density . . . 18
1.7.2 Computation of WAIC . . . 19
1.7.3 Evaluating predictive accuracy in the case of recurrent events . . . 19
2 Data source 21 2.1 The AVIS association . . . 21
2.1.1 Brief history of AVIS . . . 21
2.1.2 Italian donation rules . . . 22
2.2 Data sources . . . 23
2.2.1 The EMONET database . . . 23
2.2.2 The AVIS database . . . 24
2.2.3 Data selection . . . 25
2.2.4 Suspensions . . . 26
2.3 Features selection and data transformation . . . 27
2.4 Descriptive analysis . . . 28
2.4.1 Rate of donations and gap times . . . 28
2.4.2 Covariates . . . 33
3 Modelling blood donations as recurrent events 37 3.1 Recurrent event models for blood donations . . . 37
3.2 Modelling choices . . . 38
3.2.1 Baseline intensity function . . . 39
3.2.2 Frailty parameters . . . 42
3.2.3 Covariates . . . 43
3.2.4 At risk indicator function, censoring and suspensions . . . 44
3.3 The Bayesian model for recurrent data of M donors . . . 45
3.3.1 The likelihood . . . 45
3.3.2 Prior elicitation . . . 45
3.4 The predictive distribution of the counting process of a new incoming donor 46 4 Posterior inference on AVIS data 49 4.1 Posterior inference . . . 49
4.2 Inference on parameters . . . 50
TABLE OF CONTENTS
4.2.2 Covariates coefficients . . . 52
4.2.3 Random effects . . . 57
4.2.4 Predictive density for the count process of a new incoming donor . 59 4.3 Point predictions . . . 60
5 Forecasting new donors 63 5.1 State Space Models . . . 63
5.2 Descriptive analysis . . . 64
5.3 A Bayesian model for the new donors . . . 66
5.4 Posterior inference . . . 68
Conclusions and further developments 73
List of Figures
FIGURE Page
2.1 Histogram of the empirical rates of donation (number of donations divided for
the years of observation) . . . 29
2.2 Boxplot of the number of days passed from the observed last donation of every donors to their censoring time . . . 30
2.3 Trend of gap times with the number of donations . . . 31
2.4 Trend of the gap times with the years passed since entrance . . . 31
2.5 Histogram of the logarithm of the gap times . . . 32
2.6 Boxplots of the BMI according to the values of the categorical covariates . . . 34
2.7 Boxplots of the first donation age according to the values of the categorical covariates . . . 35
2.8 Boxplots of the donation rate grouped with the categorical variable . . . 36
2.9 Scatterplot of the donation rates against the continuous variable (AGE and BMI) . . . 36
3.1 Histogram of gap times of female donors, the red line corresponds to 180 days 41 3.2 Histogram of gap times of female donors, the red line corresponds to 90 days 41 3.3 Percentage of earlier that allowed donations as a function of the threshold age for menopause . . . 42
4.1 95 % credibility intervals for the baseline intensity function . . . 51
4.2 Estimated log posterior predictive density . . . 52
4.3 95 % credibility intervals for theβi’s parameters . . . 52
4.4 Summaries of wi . . . 54
4.6 95 % posterior predictive credibility intervals of wnewj , j = 1,..., J, the frailty of a new donor from zone j. In grey the estimate obtained with the model with
no areal dependence. . . 56
4.7 Pointwise predictive 95 % credibility intervals for Nnew(t)|xnew, where xnew is set to the mean (or to the mode) of the features used as covariates . . . 58
4.8 Mean functions for Nnew(t)|xnew, data. Unless stated otherwise, the covari-ates are set to the mean (or to the mode) . . . 59
5.1 Weekly arrivals of new donors grouped by years . . . 65
5.2 Weekly arrivals of new donors grouped by months . . . 66
5.3 Time series of the weekly arrivals . . . 66
5.4 Traceplots variance parameters Model 1 . . . 68
5.5 Model 1: decomposition of the time series . . . 69
5.6 Model 2: decomposition of the time series . . . 70
5.7 Prediction of new weekly arrivals: 95 % credibility intervals . . . 70
List of Tables
2.1 Variables from table PRESENTAZIONI in EMONET database that are in-cluded in our dataset . . . 23 2.2 Variables from table DONAZIONI in EMONET database that are included in
our dataset . . . 24 2.3 Variables from table ANAGRAFICHE in EMONET database that are included
in out dataset . . . 24 2.4 Variables from table EMC_DONABILI in EMONET database that are
in-cluded in our dataset . . . 24 2.5 Variables from table TIPIZZAZIONE in EMONET database that are included
in our dataset . . . 24 2.6 Variables from table STILIVITA in AVIS database that are included in our
dataset . . . 25 2.7 Variables from table SOSPENSIONI in AVIS database that are included in
our dataset . . . 25 2.8 Frequency table that relates the type of suspensions to the respect of the
suspensions . . . 26 2.9 Description of the features . . . 27 2.10 Number of donors that did exactly n total donations (after the first one) . . . 29 2.11 Table of the sample frequencies of the categorical variable . . . 33 2.12 Mean and standard deviation of the continuous variable . . . 34 4.1 Bayesian p-values and hazard ratios . . . 53 4.2 Predictive performances evaluation of models with different sets of covariates
using 10 fold cross validation. . . 54 4.3 Point prediction errors . . . 62 5.1 Summaries of the empirical distribution of the time series of the weekly arrivals 67
Introduction
Human blood is a natural product, not artificially reproducible, so the only way of guar-anteeing its availability for health purposes is through donations from living individuals. Blood is needed to save lives, to improve their quality and to extend their lengths. It is essential in first aids, emergency services, surgeries, organ and bone marrow transplants, cure of oncological and haematological diseases. Blood is not only essential in exceptional cases like natural disasters or accidents or in serious pathological conditions, but it is also a unique source of survival in case of chronic diseases like anaemia, liver dysfunctions, lack of coagulation factors and disorders of the immune system.
The blood donation supply chain can be divided in four phases: collection, trans-portation, storage and utilization. In the collection phase donor’s eligibility to donate is checked and, then, if the donation occurred, blood is screened in laboratory to prevent infectious diseases and it is possibly fractionated in subcomponents. Afterwards it is transported and stored to hospitals or transfusions centres, and finally it is used for a transfusion.
Ba¸s Güre et al. (2018) discuss how the management of blood collection from donors has not been adequately considered so far. Indeed most of the efforts in scientific literature are aimed to the demand prediction or to an efficient management of storage and distribution. Despite of the lack of consideration in scientific literature, collection is one of the most important phases of the blood donation supply chain. Blood has a shelf life, and so the demand of hospitals and transfusions centres has to be covered with the maximum precision, to avoid wastage of this resource. When neither the demand nor an estimate of it are present, the storage should be planned to keep constant the number of blood units of each type across days in every centre. Moreover knowing in advance the number of incoming donors can lead to an optimal planning of the appointment scheduling system, with the purpose of merging together the production balancing requirements and the service planning requirements. In this way the quality of the service from donors’
on voluntary donations. The major organization that collects volunteer blood donors is the Associazione Volontari Italiani Sangue (AVIS). It is straightforward that a precise arrivals forecast is necessary to have an efficient management of blood collection. Mod-elling and understanding the behaviour of donors is a way to do so. Some statistical models have been proposed in scientific literature. Previous works rely on the use of logistic regressions, or in modelling the gap times between donations in the framework of recurrent event. Apart from Gianoli (2016), in all the publications frequentist methods have been used, while the Bayesian approach is largely unexplored.
The class of methods used in this thesis belongs to Bayesian statistics and a recurrent event approach is adopted, but, unlinke in Gianoli (2016), event counts over time are modelled, not the waiting times between two successive blood donations.
In the last decades, thanks to the improvements of the performances of computing systems and to the spread of the MCMC methods, the Bayesian approach is spreading in the scientific world, since it is able to give a richer inference than classical statistics. Indeed, probabilistic estimates are exact because they do not rely on a large sample theory, and instruments like interval estimates have a clearer meaning. Moreover, with predictive distributions, the Bayesian paradigm offers a natural way to do forecasting.
This thesis deals with the analysis of a dataset built from real data provided by the AVIS section of Milan.
Suitable data have been downloaded, using SQL queries, from two databases in the AVIS’ server. Afterwards, a stage of pre-processing followed in order to make the raw data usable for a statistical analysis. As a result, a dataset of M individuals has been built. Times of donations, personal features and the total time of observation (namely, censoring time) were available for each individual.
Subsequently a proper model for treating blood donations as recurrent events has been formulated. At first, statistical modelling of recurrent event processes has been studied. A brief research on the state of the art of statistical methods in the field of recurrent blood donations has been done, either in Bayesian or in frequentist statistics. Then, a suitable class of models has been identified. However some modifications were done to adapt the class of models to the real phenomenon. For instance, the model can handle some typical features of blood donations cycle, such as the mandatory deferral time after each donation or the suspensions from the activities of donor.
2016), a C++ open source software which allows to make MCMC sampling.
Finally, posterior inference in the form of MCMC output has been analyzed and interpreted; moreover a way to sample from a recurrent event process has been proposed. Appropriate instruments of goodness of fit and of predictive performance accuracy have been discussed and used to compare different models (for instance, different parametriza-tions or different subsets of covariates).
The result of the work summarized above is a mathematical model that can explain the behaviour of a blood donor starting from the moment of his/her first donation. Individ-ual features are present in the model as covariates. Some of them have been identified as statistically significant and correlated to higher or lower number of donations in the time unit. The model can be also used to do individual-specific prediction of new donations.
Finally, to have a complete modelling of the number of donations in a specific blood collection centre, the time series of the weekly number of new donors has been briefly analyzed as a State Space Model. Summing up, the original contributions of this work are:
• composition of the dataset;
• the study of models for the rate function of recurrent events, particularly using the Bayesian approach;
• application of the class of models to the dataset; • predictive accuracy comparison of different models; • a State Space model to predict the number of new donors; • Stan implementation of the models.
The thesis is organized as follows.
In the first chapter an overview on recurrent event processes and on the various modeling techniques will be given, both in frequentist and Bayesian frameworks.
The second chapter is dedicated to the description of the data sources.
The particular modeling choices regarding the examined dataset will be explained in detail in the third chapter.
The fourth chapter is dedicated to the presentation of the results of the analysis. The inference a posteriori about the parameters of the model will be shown and commented.
The last chapter is devoted to the time series modelling of new donors’ weekly number.
Theoretical background on
modelling recurrent events
1
I
n this Section, a brief review of the statistical models used in the analysis of recurrent events will be presented. By recurrent event processes one refers to those kind of processes in which events are generated repeatedly over time.Afterwards there will be a brief recall on what Bayesian Statistics and MCMC methods are. Model evaluation in terms of predictive performances will conclude the chapter. Almost all the material that is included in this chapter is from Cook and Lawless (2007).
1.1
Framework and notation
A single recurrent event process starting at time t = 0 is characterized by an increasing sequence of event times {Tk, k ∈ N}, where each element of the sequence denotes the
time of the corresponding event. To this sequence it is associated the counting process {N(t), t ≥ 0}, defined as: N(t) = ∞ X k=0 I(Tk≤ t), (1.1)
where I(Tk≤ t) is a function equal to 1 when (Tk≤ t), and it is equal to 0 otherwise. The
counting process evaluated at time t records the cumulative number of events occurred in the interval [0, t]. Moreover the number of events occurred in the interval (s, t] can be expressed as N(t) − N(s).
Let H(t) = {N(s),0 < s ≤ t} be the history of the process, a recurrent event process can be defined specifying the instantaneous probability that an event occurs given the previous history and under the hypothesis that two events cannot occur simultaneously. Considering the probability that an event occurs in the interval (t, t +∆t] one can define
the intensity function:
λ(t|H(t)) = lim ∆t→0
P(N(t +∆t) − N(t) = 1|H(t))
∆t . (1.2)
Once the intensity function is known, it is possible to write the probability of a specified event history and conditional probabilities for inter-event times through the following results (see Cook and Lawless, 2007).
Conditionally on H(τ0), the probability density of the outcome "n events, at times
t1< . . . < tn, where n > 0, for a process with an integrable intensity λ(t|H(t)), over the specified interval [τ0,τ]", is:
exp³− Z τ τ0 λ(u|H(u))du´Yn j=1 λ(tj|H(tj)). (1.3)
For an event with integrable densityλ(t|H(t)) P¡N(t) − N(s) = 0|H(s)¢ = exp³−
Z t
s λ(u|H(u))du
´
. (1.4)
LetWj= Tj− Tj−1 be the waiting time between the events (j-1) and j, then: P¡Wj> w|Tj−1= tj−1, H(tj−1)¢ = exp ³ − Z tj−1+w tj−1 λ(u|H(u))du ´ . (1.5)
It is clear from the formulas above that the amount of information contained in the intensity function leads it to play a crucial role in modelling a recurrent event process.
According to the goal of the analysis, it is possible to model event occurrences through two main ways: event count and gap times. In the first scenario the focus is on the counting process N(t), while in the second case the waiting times between two consecutive events are modelled. In the next sections a brief summary of the two approaches will be given.
1.2
Recurrent events as gap times
The analysis of recurrent events as gap times is common when the events are relatively infrequent or when the system returns to the initial state after every occurrence. In this case the process is called renewal process and it is a useful framework in system failures or in case of cyclical phenomena. In a renewal process the gap times Wj= Tj− Tj−1
between the events j and (j-1) are independent and identically distributed conditionally to parameters. This condition is equivalent to:
λ(t|H(t)) =h(t − TN(t−)), (1.6)
N(t−) := lim
1.3. RECURRENT EVENTS AS EVENT COUNTS
where h(t) is the hazard function, defined as follows: h(t) = lim ∆t→0 P(W > t +∆t|W ≥ t) ∆t = f (t) S(t), (1.8)
where f (t) is the density function of the waiting times, and S(t) = P(W ≥ t) is the survival function of the waiting times. The hazard function is the main focus of a branch of statistics called survival analysis, in which times such as failures or deaths are analyzed. These kind of processes are often called time-to-event processes. Similarities between the hazard function and the intensity function are recognizable. Indeed, both represent the instantaneous probability that an event occurs at time t. Hence, the same modeling approach can be followed for both the functions.
A renewal process is equivalent to many time-to-event processes which occur one after the other, since, as it can be noticed in equation (1.6), the intensity function gets the same values after every event, losing memory of the past. A renewal process can be generalized by inducing dependence between gap times through linear models. Thus it is possible to have a trend in the waiting times.
1.3
Recurrent events as event counts
The main way of representing a recurrent event process through event counts is to model it as a Poisson process. In this special framework the events occur randomly in such a way that their number in disjoints time intervals are statistically independent. This peculiar property is reflected in an equivalent way through the independence of the intensity function at time t with respect to the history H(t) of the process. Mathematically it means that the intensity function has no dependence on the history of the process and it can be expressed in the following form:
λ(t|H(t)) = ρ(t), t > 0, (1.9)
whereρ(t) is a non-negative integrable function that is called rate function. If, for each
t,ρ(t) = ρ, which means that the intensity is constant over time, the Poisson process
is called homogeneous, otherwise it is called non-homogeneous. Asρ(t) represents the probability that an event occurs in the interval [t, t + dt], then ρ(t)dt is equivalent to the mean number of the events in an infinitesimal time interval. Hence
µ(t) =
Z t
0 ρ(u)du,
(1.10) is the mean number of events in the interval [0, t] and it is called cumulative rate function. The definition of Poisson processes implies the following properties:
• if t ≥ s ≥ 0 N(s, t) = N(t) − N(s) has a Poisson distribution with mean µ(t) − µ(s); • if (s1, t1] and (s2, t2] are non-overlapping intervals then N(s1, t1) and N(s2, t2) are
independent random variables;
• in the case of an homogeneous Poisson process with intensity ρ, the gap times Wj= Tj− Tj−1are independent and identically distributed as Exponential random
variables with survivor function equal to
P(Wj> w) = exp(−ρw) w ≥ 0; (1.11)
• if the Poisson process is non-homogeneous with mean function µ(t), the process defined with a new time scale s = µ(t) as
M(s) = N(µ−1(s)), 0 < s (1.12)
is an homogeneous Poisson process with unitary intensity.
Hence the intensity functionρ(t) can be used to model a time trend in the events.
1.4
Heterogeneity between individuals
In some contexts the events generating process may differ among individuals; such heterogeneity can be modeled by including covariates and random effects in the model.
1.4.1
Covariates in the Poisson process
The most common way of including a vector of time-varying covariates x(t) in an intensity-based recurrent event process is to consider first of all a baseline intensity functionλ0(t),
which corresponds to the intensity function of a particular individual (for example an individual who has x(t) = 0).
The next step is to consider intensities of the form:
λ(t|x(t)) = λ0(t)g(x(t);β) (1.13)
where g(x(t);β) is a non-negative integrable function and β is a vector of regression parameters. Typically g(x(t);β) = exp(x(t)0β). This is called multiplicative model or log-linear model.
When the covariates are time-invariant, their effect on a Poisson process has a simple interpretation. Indeed, conditionally on the covariates, the corresponding Poisson process
1.4. HETEROGENEITY BETWEEN INDIVIDUALS
would be characterized by intensityλ0(t)g(x;β) and mean function R0tρ0(u)du g(x;β). As a
consequence, the mean and the rate functions for two individuals with covariates x1and
x2are proportional, and g(x1;β) g(x2;β)
is the constant of proportionality (in the multiplicative model the constant is exp((x1− x2)0β)). This property does not hold in general when the
covariates are time-dependent.
Moreover, some generalizations of the multiplicative model (1.13) can be considered. A possible extension is to include, as covariates, components based on the prior events history (e.g. the number of events experienced before t or the time since the last event). Because of history-dependence, in this case the process is not a Poisson process anymore and it is called modulated Poisson process.
Another possible extension is to consider intensity functions of the form
λ(t|x(t)) = λ0(t) + g(x(t);β) (1.14)
where g(x(t);β) has to be chosen such that λ(t|x(t)) ≥ 0. This model is called additive. The last possible extension presented here is the time transform model, analogous to the accelerated failure time model in survival analysis:
λ(t|x(t)) = λ0
³Z t
0
exp(x(u)0β)du´exp(x(t)0β). (1.15) In this case s = exp(x(t)0β) can be considered as a transformed time scale.
1.4.2
Random effects
In some situations unobservable factors may create heterogeneity across different in-dividuals that experience the same recurrent event process. In this case it is useful to introduce random effects in order to capture this feature in the model. Thus, the subject-specific intensity function for the i − th individual can be written as:
λi(t|H(t), ui, x,β) = uiλ0(t), (1.16)
where ui is called frailty and it represents the unobservable individual specific random
effect.
Typically, for inference purposes, all the random effects ui can be modeled as
inde-pendent random variables equally distributed with Gamma density with mean equal to 1 and variance equal toφ, with φ ≥ 0. This model is equivalent to state that, condition-ally to ui, the stochastic process {Ni(t) : 0 ≤ t}, which represents the number of events
However, marginalizing the process over the random effects, makes the process no more Poisson. Indeed: E[Ni(t) ¯ ¯λ0(t)] =µ0(t); (1.17) var[Ni(t) ¯ ¯λ0(t)] =µ0(t) + µ0(t)2φ; (1.18) cov[Ni(s1, t1), Ni(s2, t2) ¯ ¯λ0(t)] =φµ0(s1, t1)µ0(s2, t2); (1.19) whereµ0(s, t) = Rt
sλ0(u)du andµ0(t) = µ0(0, t) and s1< t1< s2< t2.
Of course, some properties of the Poisson process are violated, for instance the mean and the variance functions are not equal. Moreover the counts in disjoint intervals are not statistically independent since their covariance is different from zero. From equations (1.18) and (1.19) it is clear that the variance of the random effectsφ quantifies both the heterogeneity across individuals (since the variance is an increasing function of it) and the dependence between counts in disjoint intervals.
Marginalizing equation (1.16) over the random effect ui leads to:
λi(t|H(t)) = λ0(t)1 + φNi
(t−) 1 + φµ0(t)
(1.20)
where Ni(t−) = lims→t−Ni(s).
This can be done by writing
P¡Ni(t+∆t)−Ni(t) ¯ ¯Hi(t)¢ = Z ∞ 0 P¡Ni(t+∆t)−Ni(t) ¯ ¯Hi(t), ui ¢ P¡Hi(t) ¯ ¯ui¢ g(ui|φ) R∞ 0 P¡Hi(t) ¯
¯ui¢ g(ui|φ)dui
dui, (1.21) then, for small∆t
P¡Ni(t +∆t) − Ni(t)
¯
¯Hi(t), ui¢ = λ0(t)ui∆t, (1.22)
remembering that the density g(ui|φ) is a Gamma with scale and shape parameters
equal toφ−1and that
P¡Hi(t) ¯ ¯ui¢ = © Ni(t−) Y j=1 uiλ0(ti, j)ª exp¡ − Z ∞ 0 uiλ0(x)dx¢, (1.23)
1.5. EXTENSIONS TO RENEWAL AND POISSON PROCESSES
Substituting (1.22) and (1.23) in (1.21) and simplifying P¡Ni(t +∆t) − Ni(t) ¯ ¯Hi(t)¢ ∆t = λ0(t) R∞ 0 u Ni(t−)+φ−1 i exp¡ − ui( Rt 0λ0(x)dx + φ−1)¢dui R∞ 0 u Ni(t−)+φ−1−1 i exp¡ − ui( Rt 0λ0(x)dx + φ−1)¢dui = (1.24) = λ0(t)Γ (Ni(t−) + φ−1+ 1) Γ(Ni(t−) + φ−1) ¡ φ−1+Rt 0λ0(x)dx ¢Ni(t−)+φ−1 ¡ φ−1+Rt 0λ0(x)dx ¢Ni(t−)+φ−1+1= λ0(t) 1 + φNi(t−) 1 + φµ0(t) (1.25) it results (1.20).
Hence, if random effects are present in the model, the intensity depends on the number of events experienced by the individual.
The random effects approach and the multiplicative model including covariates can be combined.
1.5
Extensions to renewal and Poisson processes
1.5.1
"At risk" indicator function
Another feature that can be included in the model is the heterogeneity of the observation time of each individual. In order to do so we introduce the risk indicator function Yi(t),
that is equal to 1 when the i − th individual is observed (and he or she is "at risk" of experiencing the event), otherwise it is equal to 0. For example, if an individual is observed in the interval [τ0i,τi], then Yi(t) = I(τ0i≤ t ≤ τi). The notation can also
accommodate settings where individuals are observed over disjoint time intervals, for example if an individual is lost to followup for a certain period of time.
The right end of the observation window τi is typically called censoring time and it
represents the termination of the study for i − th individual.
It is now possible to define respectively the observed part of the counting process, the history and the intensity of the observable process:
Ni(t) := Z t 0 Yi(u)dN(u); (1.26) Hi(t) :={Ni(s), Yi(s), 0 ≤ s < t}; (1.27) λi(t|Hi(t)) := lim ∆t→0 P(Ni(t +∆t) − Ni(t) = 1|H(t)) ∆t . (1.28)
In some cases information is incorporated from the history of the process into the intensity function. As a consequence, ∆Ni(t) := lim∆t→0Ni(t +∆t) − Ni(t) and Yi(t) are
conditionally independent given the history, and so the intensity of the observable process is such that
λi(t|Hi(t)) = λi(t|Hi(t))Yi(t). (1.29)
Basically, the observable process has intensity 0 outside of the observation scheme. The likelihood (1.3) can now be expressed in terms of the observable process as:
exp ³ − Z ∞ 0 λ(u|H(u))Y (u)du ´ × n Y j=1 λ(tj|H(tj)), (1.30)
and it can be used to estimateλ(t|H(t)).
1.5.2
General intensity-based model
In the previous sections the intensity functions of renewal processes and of the counting processes have been analyzed. In case of renewal processes the intensity is a function of the time since the last event. This function is called hazard function in analogy to survival analysis. In case of counting process the intensity is called rate function. Both models can be extended with covariates and random effects, by multiplying the baseline intensity function with a function of a linear combination of the covariates and/or with a parameter called frailty, which represents the variability between individuals that is not captured by the observed features.
The two models can be combined in order to have dependence both from the recurrent events count and from the gap-times. In this case the intensity can be written as:
λ(t|H(t)) = exp³α + βg1(t) + γI(N(t−) > 0)g2(t − TN(t−))
´
. (1.31)
The functions g1(t) and g2(t) express the dependence from calendar time and from the
time since the last event, respectively. When the parameterγ is equal to 0 the recurrent event process is a Poisson process, and when the parameterβ is equal to 0 the recurrent event process is a renewal process, since the intensity depends only on the waiting times. The intensity depends on the process itself, and so it is not always possible to have a well defined analytical framework like in the Poisson process model. However, thanks to (1.5), it is possible to simulate the gap-times and hence to have a Monte Carlo estimate of the law of N(t), for any t.
1.5. EXTENSIONS TO RENEWAL AND POISSON PROCESSES
1.5.3
Multi-state Markov models
There are at least two possible approaches in order to introduce the dependence of the recurrent event process on the number of events experienced until time t.
The first is to introduce a function of N(t−) in the covariates, while the second is to model
the process as a Multi-state Markov model. In this particular framework every individual at every time is in a particular state, which it corresponds to the cumulative number of events experienced until that moment. The transition from a state to another is possible only from the state k to the state k + 1, and to every transition it is associated to an intensityαk(t), where
αk(t) = lim ∆t→0
P³N(t) − N(t −∆t) = 1|N(t −∆t) = k, H(t)´
∆t . (1.32)
Hence, the intensity of the process can be written as:
λ(t|H(t)) = X∞
k=0
αk(t) I(N(t) = k). (1.33)
In the caseαk(t) = α(t) for every k the model is the canonical Poisson process with α(t)
as a rate function.
1.5.4
Modelling the baseline intensity function
Once covariates and frailties have been introduced in the model, an important issue is the choice of the baseline intensity function. This choice can be either parametric or non-parametric. The simplest parametric choice for the baseline intensity function is the constant intensity. This choice implies an homogeneous Poisson Process, where gap-times are distributed as Exponential random variables and the mean function is linear with respect to the time.
In some contexts the intensity function cannot be constant over time. This is the case either of diseases in which there is a significant infant mortality (decreasing intensity function) or of aging process in which the events are more likely to happen once some time is passed (increasing intensity function). Then a possible extension is the Weibull model, where the gap times are independent random variables distributed with density:
f (x|λ,α) = λαxα−1e−λxαI{x≥0}(x). (1.34)
Ifα > 1 then the intensity is increasing, if α = 1 the intensity is constant, otherwise it is
decreasing. Under this assumption:
and {N(t) : 0 ≤ t} is a Poisson process.
The baseline intensity function can assume a non-parametric form in the following way. Let us divide the observation time in K disjoint intervals taking a0= 0 < a1< . . . <
aK as cut-points. For each of the resulting sub-intervals (ak−1, ak] let us assume that
the intensity is constant and equal to λk> 0. Now the baseline intensity function is
characterized by the vector of parameters (λ1, . . . ,λK). This kind of model can approximate
the shape of every type of intensity function, and the approximation will be as good as K is large enough. However a larger K implies more parameters to estimate and though a greater computational effort. In this case, including time varying covariates {xi(t) : 0 ≤ t},
the likelihood of the model can be expressed as the product of the contributions that every individual has on the specific interval:
K Y k=1 n λn·k k M Y i=1 © exp( ni X j=1 xi(ti j)0βI(ak−1,ak](ti j) − λk Z ak ak−1 Yi(s) exp(xi(s)0β)ds)ª o , (1.36)
where ti j is the time of the j-th event experienced by the i-th individual, M is the
total number of individuals, and n·k=Pm
i=1nik, where nikis the total number of events
between ak−1and akexperienced by the i-th individual.
The cut-points can be chosen in different ways. In order to guarantee an estimate of every
λk the observation of at least one individual must fall into the corresponding interval; for
this reason one possible choice is to set akas the
k
K empirical quantile of the distribution of the event times. Another possible choice, simpler and independent of the observation, is to divide the observation time in K equispaced intervals. Of course this modelling issue must be object of an analysis of sensitivity, both on the number of cut-points K and on their position on the time domain. In the literature of survival analysis Gustafson et al. (2003) suggest the use of the quantiles, while Yin et al. (2006) and Sahu et al. (1997) propose the use of equispaced grids. In the field of recurrent event process in a Bayesian setting Pennell and Dunson (2006) use a tightly spaced grid and an auto-correlated prior in order to borrow strength between intervals.
If one imposes thatλ1, . . . ,λK are independent random variables distributed with
proper Gamma distributions, the resulting cumulative intensity functionµ(t) = R0tλ(u)du is a realization of a Gamma process (see Kalbfleisch, 1978), which is a particular stochas-tic process built such that the increments are independent random variables Gamma distributed, namely:
1.6. THE BAYESIAN APPROACH
whereφ(t) is an increasing function, and c is a positive-valued parameter.
1.6
The Bayesian approach
1.6.1
Bayesian Statistics
In a statistical model once a dataset y = (y1, . . . , yn) is observed it is possible to associate
a measure of beliefs through p( y|θ), which depends on a vector of parameters θ. p(y|θ) is called likelihood, and it is a probability measure. The vectorθ typically summarises the characteristics of the population from which the dataset y is sampled. While in the frequentist framework θ is a fixed number, in Bayesian statistics it is a random variable, and a probability measureπ(θ) is associated to its every possible value. π(θ) is called prior probability. Hence the likelihood function p( y|θ) has to be interpreted as the probability associated to y onceθ is the true parameter vector. Summarising:
• π(θ) is a measure of beliefs that θ represents the true characteristics of the
popula-tion;
• p( y|θ) is a measure of beliefs that y would be sampled from the population if θ is the true parameter.
The Bayesian approach offers a way to update the prior beliefs aboutθ with the com-putation of the posterior distribution π(θ|y), which is a function that summarises the beliefs aboutθ once y is observed. This is done by using the Bayes’ Theorem:
π(θ|y) = p( y|θ)π(θ)
R p(y|θ)π(θ)dθ, (1.38)
where the integral is on all the support ofθ.
Once this function is known it is possible to compute all the summaries of the posterior distribution like the posterior mean E[θ|y], the posterior variance V ar[θ|y] or to make an interval estimate C such that P(θ ∈ C|y) = 1 − α.
The Bayesian method offers a typical scientific approach where some hypothesis on a phenomenon (summarised inπ(θ)) are validated by the collection of data y yielding to a new point of view, namely the posterior distributionπ(θ|y).
In this thesis the Bayesian approach will be followed. In Chapter 3 the statistical model is set up with the likelihood of the data and the prior elicitation, while in Chapter 4 the posterior inference is showed and commented.
1.6.2
Monte Carlo Markov Chains
Equation (1.38) is usually an intractable expression, hence all the inference can be done by simulating a sample from the posterior distribution. Monte Carlo Markov Chains (MCMC) methods offer a way to do so.
MCMC is a class of algorithms in which a Markov chain whose stationary distribution is the posterior distribution is simulated. This means that every step of the Markov chain can be considered as a draw from the posterior distribution, if we let run the simulation for enough time. The MCMC algorithms generates a Markov Chainθ(1), . . . ,θ(T), where
θ(t) is independent ofθ(1), . . . ,θ(t−2) conditionally onθ(t). Then, under general conditions,
if T → ∞ and if h(θ) is a measurable function: 1 T T X t=1 h(θt) → Z h(θ)π(θ|y)dθ = E[h(θ)|y]. (1.39)
Hence all the summarises of the posterior distribution can be approximated by averaging over the MCMC sample.
The MCMC algorithm used in this thesis is the Hamiltonian Monte Carlo (HMC), which is efficiently implemented in a software called Stan (see Stan Development Team and others, 2016). Stan is an open source software written in C++ that can be integrated with the software R with the package rstan.
1.6.3
Discretization of the Gamma process prior
A possible implementation of a non-parametric intensity model is the Gamma process. In Johnson et al. (2010), Section 13.2.5, a discretizationn of the Gamma process prior in the survival setting is given, but this can be extended to the framework of recurrent events. The model is the following.
First of all, a partition of the time domain must be given. Let us call it a0:= 0,..., aK.
Then the idea is to center the intensity function on a certain valueλ∗, which corresponds to the intensity function of an Exponential random variable of parameter λ∗. As a
consequence, all the pieces of the intensity function must satisfy the equation:
E[λk|λ∗] = λ∗. (1.40)
A further requirement is that the prior variance of each λk is inversely proportional
to the length of the corresponding interval ak− ak−1an to another parameter called w,
1.6. THE BAYESIAN APPROACH
defined, the last condition to impose is that the increments are Gamma-distributed, and this is equivalent to:
λk|w, λ∗ ind∼ Gamma
¡
λ∗w(a
k+1− ak) , w(ak+1− ak)¢, k = 1,..., K. (1.41)
The parametersλ∗and w can be fixed or they can be modelled with a prior distribution. Another important reference for Bayesian modelling of recurrent events is Ouyang et al. (2013), in which it is also discussed the case where the termination of the obser-vation of the recurrent event process is dependent on the process itself. In their work Ouyang et al. (2013) propose to model the steps of the intensity function as a priori independent and identically distributed, which is the approach that it is used in this thesis.
1.6.4
Autocorrelated prior for the baseline intensity function
In Pennell and Dunson (2006) the prior structure ofλ1, . . . ,λK is built to have correlations
among the parameters. Every steps of the intensity function is written as
λk= ˆλk∆j, (1.42)
where ˆλk is the initial guess on the baseline intensity in that interval and ∆j is a
multiplicative effect. The multiplicative effects are modelled in the following way:
∆j=ν0 j Y h=1 νh j = 1,..., K (1.43) ν0∼Gamma(φ, φ) (1.44) νji.i.d.∼ Gamma(ψ, ψ) j = 1,..., K. (1.45)
It can be noticed that∆j=∆j−1νj, and so a covariance structure is induced in the
multi-plicative effects. Moreoverφ controls the degree of shrinkage of the posterior towards the initial guess on the baseline, andψ regulates the smoothness in the deviations from the prior estimate.
Another autocorrelated prior is proposed in Arjas and Gasbarra (1994). In this case
λ1∼Gamma(α1,β1) (1.46) λk|λk−1, . . . ,λ1i.i.d.∼ Gamma ¡ α, α λk−1 ¢, k = 2,..., K, (1.47)
with α1 and β1 that have to be chosen in order to model the value at time t = 0 of
variance of the parametersλk. In fact from the following equation
E[λk|λk−1, . . . ,λ1] =λk−1 (1.48)
p
V ar[λk|λk−1, . . . ,λ1] =λpk−1
α , (1.49)
it can be noticed that, ifα is very small, high deviations from the mean are allowed. In the limiting case ofα → ∞ the baseline intensity function is a priori constant. Equation (1.48) is equivalent to assume that the baseline intensity function has a martingale structure with respect to the prior distribution and the internal filtration.
1.7
Model evaluation in terms of predictive
performances
The fitting of a statistical model is often followed by its evaluation in terms of predictive accuracy. The idea is to obtain an unbiased and accurate measure of the out-of-sample predictive error. This issue has been tackled also in Bayesian statistics, for example in Gelman et al. (2014) and Vehtari et al. (2017).
The most natural way to estimate the predictive error is through cross-validation, however it requires multiple fits of the model and, especially in the Bayesian setting, this could be a problem because of the computational burden of the MCMC methods. Alternative methods aim to estimate the out-of-sample predictive error with the data, using a correction for the bias that arises from evaluating the model’s prediction on the data used to fit it. Some of these measures are the Akaike Information Criterion (AIC), the Deviance Information Criterion (DIC), or the Watanabe–Akaike information criterion (WAIC), which is a fully Bayesian method.
1.7.1
Log posterior predictive density
Consider data y1, . . . , ynmodeled as observations of independent random variables given
parameterθ. The contribution of the single data point yi to the likelihood of the model is
p( yi|θ), while the total likelihood is p(y|θ) =Qni=1p( yi|θ). The notation can be generalized
even when there are covariates substituting p( yi|θ) with p(yi|θ, xi). If a new data point ∼
y is produced by the true data generating process, the out-of-sample predictive fit for this datum can be computed as:
log p(∼y |y1, . . . , yn) = log E£ p( ∼ y |θ)¯ ¯y1, . . . , yn¤ = logR p( ∼ y |θ)p(θ|y1, . . . , yn)dθ (1.50)
1.7. MODEL EVALUATION IN TERMS OF PREDICTIVE PERFORMANCES
This quantity can be estimated by: l p pd = log n Y i=1 p( yi|y1, . . . , yn) = n X i=1
logR p(yi|θ)p(θ|y1, . . . , yn)dθ, (1.51)
where lppd stands for log pointwise predictive density. Equation (1.51) is a biased estimate of the (1.50) since the out-of-sample predictive fit is evaluated in the data point itself, indeed the observation yi appears both in the likelihood p( yi|θ) and in
p(θ|y1, . . . , yn), which is the posterior distribution ofθ.
To compute (1.51), it is possible to evaluate the expectation using draws from the posterior distribution of the parameters p(θ|y1, . . . , yn), that are indicated asθ(s), s = 1,..., S.
com puted l p pd = n X i=1 log¡ 1 S S X i=1 p( yi|θ(s)) ¢ (1.52)
1.7.2
Computation of WAIC
WAIC (introduced by Watanabe in 2010) estimates the out-of-sample predictive measure by computing expression (1.52) and then adding a bias correction. Then, the expected log pointwise predictive density is computed as:
el p pdW A IC= l p pd − pW A IC, (1.53)
where pW A ICis the adjustment, that can be computed in two ways:
• pW A IC1= 2Pni=1¡ log E[p(yi|θ)
¯
¯y1, . . . , yn] − E[log p(yi|θ)|y1, . . . , yn]¢;
• pW A IC2=Pni=1V ar[log p( yi|θ)
¯
¯y1, . . . , yn].
Both the measures can be approximated once an MCMC sample is available.
Gelman et al. (2014) recommend pW A IC2, because, in its series expansion, equation (1.53)
resembles leave-one-out cross validation.
1.7.3
Evaluating predictive accuracy in the case of recurrent
events
All the formulas in the previous section rely on the division of the data in some partition (the yi’s with which it is possible to compute the probability p( yi|θ)).
In the case of recurrent event process one possibility is to consider the whole process of events for every individual i in the study. Hence p( yi|θ) is:
exp ³
− Z ∞
0 λi(u|Hi
(u),θ)Yi(u)du
´ × ni Y j=1 λi(ti j|Hi(ti j),θ), (1.54)
In the case of a multiplicative model, with random effects and with the presence of covariates
λi(t|Hi(t),θ) = wnewλ0(t|H(t),θ)exp(x0iβ). (1.55)
Since the main interest lies in predictive accuracy, wnew is not the random effects of the individual i (which is estimated in the model), but is the frailty of a new incoming individual given the observations.
Data source
2
I
n this chapter details on the dataset that has been analyzed are given.The first section is devoted to present the AVIS association, from its history to the rules that regulate blood donations. All the information given are taken from the websites of AVIS and AVIS Milan.
Then it follows a thorough description of AVIS and EMONET databeses (the data sources).
2.1
The AVIS association
2.1.1
Brief history of AVIS
The Associazione Volontari Italiani Sangue (AVIS) was born in Milan in 1927 thanks to the physician Vittorio Formentano, who made an appeal on a daily newspaper of the time to form a group of donor volunteers. Seventeen persons answered the call and formed the first AVIS group of the history.
However the official formation of the association is dated on 1929; transfusion thera-pies started to be accessible to everybody, and not only to wealthy people. At the same time the memorandum of the association has been approved. A passage of the memoran-dum can be translated as follows: "The finality of the Association is to promote, especially in the working class, the humanitarian, social and patriotic concept of the voluntary offering of their own blood." In this period groups of blood donors associations born in other cities like Ancona, Bergamo, Brescia, Torino, Napoli, Cagliari, Cremona.
With the purpose to coordinate the local groups spread in Italy, in 1946 the Association assumed a national form, with Milan as headquarter.
In 1950 the Republic of Italy gave legal recognition to AVIS with Law n. 49; in 1967 Law n. 592 recognized the civic and social role of AVIS in the organization and promotion in matter of transfusion, while in 1990 another law established the principle
of the gratuity of blood donation. Furthermore, it is stated that the voluntary blood donor associations and the related federations contribute to the institutional aims of the National Health Service concerning the promotion and development of blood donations and the protection of donors.
The activity of the association became more and more popular and in 2005 AVIS reached the goal of one million donors and in 2009 for the first time since the foundation more than two millions of donations took place in Italy.
In 2017 AVIS had its 90thbirthday; through its long life it has become one of the most important voluntary associations in Italy.
2.1.2
Italian donation rules
Because of the importance of blood in healthcare, there are some rules that regulates the mechanism of blood donations. These rules are meant to protect both the health of the patient who will receive the blood and the health of the donor himself/herself. A legislative act called "Disposizioni relative ai requisiti di qualità e sicurezza del sangue e degli emocomponenti" (see Ministero Della Salute, 2015) collects all these rules.
Any candidate donor must be between 18 and 60 years old. However the responsible physician can allow a candidate donor older than 60 years old to donate for the first time. The anagraphic age limit is increased to 65 years old for periodic donors, even in this case the physician can allow a person to donate until 70 years old after a clinical evaluation of the risks correlated to the age. Every donor must weigh more than 50 Kg, the blood pressure, the frequency of the heartbeats and the level of hemoglobin must lie between certain ranges. The yearly maximum number of donations for men and for women who are in menopause is 4, for the other women is 2. By law, the minimum gap time between two consecutive donations is 90 days. In order to respect the restriction on the yearly maximum number of donations for women the minimum gap time is put to 180 days, but this is an internal rule of the association, not a law limit. However the responsible physician can move up the donation if he or she thinks that the health and the wellness of the donor are not in danger. The donor can be suspended from the activity for a certain time or forever if the donation can in some way compromise his/her own health status or the quality of the component donated. Suspensions are not exceptional events; for example journeys in exotic countries, dental care, change of the partner or a recent flu are some causes of temporary suspensions. Of course, the length of the suspension is related to the severity of the cause.
2.2. DATA SOURCES
2.2
Data sources
The data of Milan’s AVIS section come from two databases: the EMONET database and the AVIS database. Data used in this work have been collected from multiple tables of the two databases. The EMONET database is made of tables concerning donations or personal data of the donors; the AVIS database contains information about suspensions and donors’ habits. All the data have been extracted using SQL queries on the AVIS’ servers, and have been joined with the unique ID of the donor and/or with the unique ID of the blood donation. The dataset has been built only with the tables going from 2.1 to 2.7. In the next subsections some tables describes the two databases.
2.2.1
The EMONET database
We have considered five tables in the EMONET database
• tables PRESENTAZIONI and DONAZIONI contain some information about the donations (see Tables 2.1 and 2.2);
• tables TIPIZZAZIONE and ANAGRAFICHE contain information about the donors (see Tables 2.5 and 2.3);
• table EMC_DONABILI records the blood components that could be donated (see Table 2.4 ).
Column Type Description
CAI numerical donor unique id
DTPRES date-time date and time
IDPRES numerical donation unique id
TIPO_ATTIVITA categorical (D for donation, C for control)
ID_PUNTPREL numerical AVIS location unique id
Table 2.1: Variables from table PRESENTAZIONI in EMONET database that are in-cluded in our dataset
Column Type Description
CAI numerical donor unique id
DTPRES date-time date and time
IDPRES numerical donation unique id
ID_EMCDON categorical blood component unique id: 1 for whole blood, 2 for plasma, ...
VALIDITA categorical V if the donation was effective, N otherwise
Table 2.2: Variables from table DONAZIONI in EMONET database that are included in our dataset
Column Type Description
CAI numerical donor unique id
SESSO numerical donor gender (1 for man, 2 for woman)
DATANASCITA date donor’s birthday
CAP_DOMIC categorical donor’s domicile postal code CAP_RESID categorical donor’s residence postal code
Table 2.3: Variables from table ANAGRAFICHE in EMONET database that are included in out dataset
Column Type Description
ID_EMCDON categorical blood component unique id: 1 for whole blood, 2 for plasma, ... INTERVALLO numerical minimum gap time between two donations of the component
DESCR character description of the blood component NDONMAXMAS numerical maximum number of donation in a year for men NDONMAXFEM numerical maximum number of donation in a year for men
Table 2.4: Variables from table EMC_DONABILI in EMONET database that are included in our dataset
Column Type Description
CAI numerical donor unique id
AB0 numerical blood type (A, A1, A2, A3, B, AB, A1B, A2B, 0)
TIPO_RH categorical Rhesus factor (POS or NEG)
Table 2.5: Variables from table TIPIZZAZIONE in EMONET database that are included in our dataset
2.2.2
The AVIS database
In AVIS database two tables have been considered. Table STILIVITA registers some information about the lifestyle of the donors (Table 2.6), while all the suspensions have
2.2. DATA SOURCES
been recorded in table SOSPENSIONI (Table 2.7).
Column Type Description
CAI numerical donor unique id
FUMO categorical smoking habits
ALCOOL categorical drinking habits
THE categorical tea consumption
CAFFE categorical coffee consumption
DIETA categorical diet type
STRESS categorical stress level
ATTIVITAFISICA categorical physical activity habits CIRCONFERENZAVITA numerical abdominal circumference
ALTEZZA numerical height
PESO numerical weight
BMI numerical Body Mass Index
Table 2.6: Variables from table STILIVITA in AVIS database that are included in our dataset
Column Type Description
CAI numerical donor unique id
TIPO_SOSP categorical T for temporary, D for definitive
DATAINSERIMENTO date-time suspension starting date
DATARIAMMISSIONE date-time suspension ending date
Table 2.7: Variables from table SOSPENSIONI in AVIS database that are included in our dataset
2.2.3
Data selection
For this work the whole period going from the 1stof January 2010 to the 30thof June 2018 has been considered as observation time. The focus of the analysis is on donations of whole blood performed in the main building of AVIS Milano, that is located in the district of Lambrate. We have considered only "new" donors, namely people who have become donors in this period, discarding all the others. For every donor there is an observation interval that has its origin in his/her first whole blood donation and its end in the 30thof June 2018, which it is considered as a censoring time. According to this selection criteria there is a dataset composed of 9175 donors; each donor’s observation time has a length that is generally different from the others, with a different number of donations.
2.2.4
Suspensions
The donor could be suspended from his/her activity for a certain period of time if his/her wellness or the quality of the blood component are in danger. These facts are registered and the suspensions are collected in the databases of the Association (see Table 2.7).
In this period 805 suspensions related to 618 donors are registered. However many of these suspensions are overlapping; this may happen when after a further control the suspension is extended because the reasons to preclude the person to donate remain. For each suspension the beginning and the end times are present, and a categorical variable named TIPO_SOSP points out if it is a life-suspension or a temporary suspensions. Among these, there are 421 temporary suspensions for 348 donors without an end date, hence it is difficult to correlate the effect of the suspension on the individuals’ donations. The remaining ones are not respected in 92 cases, which is about the 25% of the times, and so a blood donation is performed during the suspension.
Definitive Temporary
NOT RESPECTED 5 87
RESPECTED 42 250
Table 2.8: Frequency table that relates the type of suspensions to the respect of the suspensions
The fact that not all the suspension are respected does not mean that there is a lack of control of the Association on this issue, indeed the responsible physician can decide to move up the end of the suspension, and this is probably that case. A possible solution to this issue could be to think the real end of the suspension as the minimum between the time of the successive donation and the registered end of the suspension. However the temporary suspensions without an end time remain an issue, because with the above solution there is the possibility that an individual who does not return to donate for a long time for his/her will can be confused with an individual for whom the donation is precluded.
Other data that are available are donations of other blood components. It is possible to think the period of rest after each donation as a suspension from the donations of whole blood and to include these information in the analysis. From Tables 2.4, 2.1, 2.2 it is possible to have the starting and the end date of the period of inactivity due to donations of blood components different from whole blood. Hence the data about suspensions are completed with 727 observations related to 267 donors. Only 5 of these are not respected.
2.3. FEATURES SELECTION AND DATA TRANSFORMATION
2.3
Features selection and data transformation
Feature Levels Description Missing
SESSO 2 Gender of the donor (1 Male, 0 Female) 8
FUMO 15 Daily number of cigarettes 315
ALCOOL 6 Daily weigth of alcool consumed 315
THE 7 Daily number of cups 2843
CAFFE 7 Daily number of cups 2232
DIETA 7 Kind of diet 315
ATTIVITAFISICA 15 Sport level 315
PESO - Height in cm 315
ALTEZZA - Weight in kg 315
STRESS 5 From absent to stressed 315
AB0 9 Blood type 0
RH 2 Positive or negative 0
CAP_DOMIC 1482 Postal code 51
Table 2.9: Description of the features
There are many features in the databases that can be used as covariates in a statistical model (see Sections 2.2.2 and 2.2.1). Let us focus on the ones described in Table 2.9. There are missing values for some donors. When the missing values were in a notable number (like in the variables THE and CAFFE, namely the daily number of cups of coffee and tea, see Table 2.9) the whole feature have been discarded (column-wise deletion), while for all the other features just the corresponding individual has been discarded (row-wise deletion). Most of the features are categorical variable with many levels. In order to make them suitable for a statistical model they have been transformed into binary dummy variables.
• The variable FUMO takes the value 1 if the donor is a smoker, 0 if he or she is not; • the variable ALCOOL takes value 0 if the donor declare to not consume alcoholic
beverages, 1 otherwise;
• the variable ATTIVITAFISICA takes value 0 if the donor declare to have a seden-tary lifestyle, or if he/she consider low or irregular his/her level of physical activity; • the blood type is transformed into a 4 levels dummy variable (A,B,0,AB). For
The variables DIETA and STRESS do not seem to be useful for an analysis, almost all the donors declare to have a balanced diet and an absent level of stress.
Numerical features are also present:
• AGE is the age of the donor when he/she donates for the first time in his/her life;
• with the variables PESO (weight) and ALTEZZA (height) the Body Mass Index (BMI) has been computed as:
BM I = W ei ght H ei ght2
where the weight is expressed in Kg and the height in meters.
2.4
Descriptive analysis
2.4.1
Rate of donations and gap times
At the end of the procedure of data selection 9175 donors were registered in the dataset. All these persons together did 34864 donations of whole blood in the period that goes from the 1st of January 2010 to the 30thof June 2018.
As it can be noticed in Table 2.10, about 35 % of them just entered in the study, without any further donation. Since the goal of the proposed models is to describe donations as recurrent events these individuals are excluded from the analysis. The others will be called "recurrent donors".
The total number of donations for a donor does not give all the information about how much a person donates in a certain time period. This number must be related to the time in which each individual is observed, for example dividing it for the years of observation. The empirical rates of donation have been computed (only for recurrent donors) and are shown in figure 2.1. Notice that the empirical distribution of the yearly rate of donation is left-skewed: most of the donors did less than two donation per year.
2.4. DESCRIPTIVE ANALYSIS
Total donations (n) Donors Sample frequency
0 3238 0.3529 1 1608 0.1753 2 1101 0.1200 3 723 0.0788 4 555 0.0605 5 417 0.0454 6 292 0.0318 7 262 0.0286 8 178 0.0194 9 160 0.0174 10 119 0.0130 11 101 0.0110 12 75 0.0082 13 64 0.0070 >13 282 0.0307
Table 2.10: Number of donors that did exactly n total donations (after the first one)
Empirical yearly rate of donation
RATE Density 0 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Figure 2.1: Histogram of the empirical rates of donation (number of donations divided for the years of observation)
It can be noticed that there could be a problem of loss to follow-up. This fact can be realized by computing, for each donor, the number of days passed from the last donation to the censoring time (namely, the last day of observation). See Figure 2.2 for the boxplot of this quantity.
Loss to follow-up happens when an individual voluntarily abandoned the study, and so he/she does not show up for a long period of time. However blood donations are on a voluntary basis, hence we do not know if a donor actually decided to stop his/her activity or he/she is only postponing the next donation.
If one believes that the history of the process influences the fact that some individuals do not show up for a while then some choices about the censoring time Ci have to be done.
Then the dependence between the process and Ci must be modelled (see Ouyang et al.
(2013) for event dependent censoring time and chapter 7 of Cook and Lawless (2007) for more details about loss to followup).
Non−recurrent donors recurrent donors
0
500
1000
2000
3000
Figure 2.2: Boxplot of the number of days passed from the observed last donation of every donors to their censoring time