Adaptive-size online dictionary learning for nonstationary environments

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering Department of Electronics, Information and Bioengineering Master of Science Degree in Computer Science and Engineering

Adaptive-Size Online Dictionary Learning

for Nonstationary Environments

Advisor: Prof. Giacomo Boracchi

Co-Advisor: Dott. Diego Carrera

Master Thesis by:

Andrea Guglielmetti, 900297

(2)

(3)

iii

Vorrei ringraziare il prof. Giacomo Boracchi per il grande aiuto fornito e la pazienza dimostrata nella stesura di questa tesi. Ringrazio i miei amici, per i sorrisi e le risate dei sabato sera. Un grazie speciale va a Jessica e ai miei genitori, per avermi supportato (e sopportato) durante i miei studi.

(4)

(5)

Abstract

Dictionary learning is a machine learning method for building a model that rep-resents a signal using a limited number of atoms. This technique has found many applications in different fields, ranging from image reconstruction to classification, anomaly detection, and representation learning. However, a common assumption of these applications is the stationarity of signal, meaning the generative process behind them is assumed not to be drifting, evolving, or changing.

In this thesis, we explore an extension to the online dictionary learning frame-work capable of a fast adaptation to incoming signals from an earlier distribution. Indicators based on the dictionary are collected for each signal during the learning phase, and used to determine whenever adapting the dictionary to new data is necessary. The adaptation is performed by adding new atoms that are directly learned on the most recent incoming signals. Our procedure also removes atoms that are not necessary for representing the incoming signals.

We tested the algorithm on real and synthetic datasets. The results collected on highly-structured signals denote that our solution reaches lower reconstruction error and sparser representation than alternative solutions. On the other hand, in the presence of less-structured signals, such as patches coming from real-life images, our solution tends to adapt more frequently, resulting in an increment in the reconstruction error compared to other solutions.

(6)

(7)

Sommario

Il Dictionary Learning è una tecnica di machine learning che punta a costruire un modello in grado di rappresentare un segnale usando un numero limitato di atomi. Questa tecnica viene utilizzata in diversi campi, spaziando dalla ricostruzione di immagini a problemi di classificazione, riconoscimento di anomalie e representation learning. Comunemente si presuppone che i segnali usati nell’apprendimento siano generati da un processo stazionario, ovvero un processo che si ipotizza che non evolva o cambi nel corso del tempo.

In questa tesi, si esplora un’estensione all’Online Dictionary Learning frame-work in grado di permettere al dizionario di adattarsi più velocemente ad un cam-biamento nel processo generativo dei segnali partendo dal dizionario appreso con il processo corrente. Per determinare se è necessario modificare la struttura del dizionario si utilizzano degli indicatori ottenuti dai segnali durante la fase di ap-prendimento. L’adattamento al nuovo processo avviene tramite l’inserimento di atomi appresi direttamente dai segnali più recenti. Inoltre, vengono rimossi gli atomi non necessari per rappresentare il nuovo processo.

La soluzione proposta è stata testata sia su dataset sintetici e provenienti da immagini naturali. I risultati ottenuti con segnali altamente strutturati denotano che la nostra soluzione raggiunge errori di ricostruzione minori e rappresentazioni più sparsi rispetto a soluzioni alternative. D’altra parte, in presenza di segnali meno strutturati, come patches provenienti da immagini di uso quotidiano, la nos-tra soluzione tende ad adattare il dizionario più frequentemente, risultando in un incremento nell’errore di ricostruzione rispetto alle altre soluzioni.

(8)

(9)

List of Figures

2.1 Example of sparse representation. . . 12

2.2 Comparison between different `p-norm. From left to the right: p = 0, p = 0.5, p = 1, p = 2. . . 13

2.3 Graphical representation of the _P₁ε problem. The red ellipses rep-resent the residual least square solution, while the blue diamond is the `1-norm. . . 14

2.4 Example of the LARS algorithm. The solution starts at x0, then evolves along the atom d₁ which is the most correlated with the residual. Then, at x1, the atom d2 is as correlated as d1, thus the solution evolves along the direction that preverves the equiangular-ity. . . 14

4.1 The different states of the algorithm and their transitions. . . 34

4.2 Log-likelihood of the indicators when there are no changes in the generative process (left) and when the process changes (right). The blue circles represents the current anomaly indicators Gt. The con-tour plot the estimated probability distribution ˆΦ_t−1 . . . 38

5.1 Dictionary used for generating the synthetic signals. . . 44

5.2 Results with different values of λg. . . 46

5.3 Results with different values of δ. . . 46

5.4 Comparison of RMSE and dictionary size during learning varying Ck (rows) and ρ (columns). . . 48

5.5 Comparison of processes’ RMSE during learning varying Ck (rows) and ρ (columns). . . 49

5.6 Comparison of processes’`0 -norm during learning varyingCk(rows) and ρ (columns). . . 50

5.7 Images used from the Brodatz dataset. The alphabetical order de-notes the appearance in the dataset. . . 52

5.8 Comparison between the different methods of the dictionary size using texture-images. . . 53

5.9 Comparison between the different methods of the dictionary size and the RMSE for each process using texture-images. The measures are obtained at the change point before the adaptation process. . . 54

5.10 Comparison between the different methods of the dictionary size and the sparsity for each process using texture-images. The measures are obtained at the change point before the adaptation process. . . 55

(12)

5.11 Comparison of the dictionary size between the different methods learning from real-life images. . . 57 5.12 Dictionary size vs RMSE (above) and Dictionary Size vs average`0

(below) for the different methods. The measures are obtained at the change point before the adaptation process. . . 58

(13)

List of Algorithms

1 Matching Pursuit Algorithm . . . 13

2 LASSO-LARS . . . 15

3 Online Dictionary Learning . . . 17

4 Build RS-Tree . . . 22

5 Update RS-Tree . . . 23

6 Neurogenetic Online Dictionary Learning . . . 25

7 Online Adaptive-Size Dictionary Learning for Nonstationary Envi-ronments . . . 33

8 Determine if the adaptation phase needs to continue . . . 35

9 Dictionary Update . . . 36

10 Remove atoms after detecting a change . . . 40

11 Expand dictionary and memory matrices . . . 41

12 Atom Generation from the current batch . . . 42

(14)

(15)

Chapter 1 Introduction

Dictionary learning is a machine learning technique to learn a model composed of primitive components directly from a dataset of signals of interest. The resulting model, called dictionary, can represent other signals of the same family as a linear combination of a few atoms. For example, consider a digit image (the signal); this image can be constructed using only a small number of ink strokes among all possible ink strokes (the dictionary). The dictionary learning has been widely used in various kind of applications; for example, in image processing, it is used for denoising, for restoring missing pixel (inpainting), or for image classification. Other possible applications can be found in anomaly detection or representation learning. Traditional dictionary learning algorithms are iterative offline methods that perform learning by alternating the computation of the sparse representation of the entire dataset, followed by the update of the dictionary. When dealing with applications that require large training datasets, such as image or video process-ing, this approach presents some critical drawbacks, such as the large requirement of physical memory or computation times. Another problem is the necessity of complete retraining of the model to include new training data. While the online dictionary learning framework presented in [1] solves these issues, it assumes that signals do not change or evolve during the time, updating the dictionary without altering the number of atoms.

In a nonstationary environment, the process that generates the signals can change over time. For example, this effect can be due to seasonality, periodic or aging effects, or faults in sensors monitoring the environment or other possible causes resulting in a change of the generative process. In these conditions, a model trained under the assumption of data-stationarity can become obsolete during time and yield erroneous results. In the dictionary learning setting, the dictionary could produce an inadequate approximation in terms of error or a non-sparse representa-tion losing the interpretability of the model. Hence, we propose to include within the learning process a mechanism for detecting changes and another mechanism for allowing the dictionary to adapt to this new environment.

As noted in [2], learning a dictionary in a nonstationary environment can be considered as an online model selection problem. This approach, in contrast to traditional offline model selection, consists of the gradual growth of the dictionary accordingly to the fitness of its representation ability to the current process. In other words, when a dictionary is no longer able to represent the current process

(16)

within an acceptable error bound, its complexity (in terms of the number of atoms) is increased for accounting the presence of new signals. However, while deciding how to modify the dictionary size, it must be considered the resulting effect on the sparse representation. Whereas smaller dictionaries could not approximate signals within acceptable error boundaries, larger dictionaries could prevent the

computation of an optimal sparse representation. Hence, while increasing the

number of atoms allows us to adapt the model to a new process, we also consider implementing a forgetting mechanism for removing atoms related to old processes and limiting the growth of the dictionary.

We propose an extension to the Online Dictionary Learning framework able to detect changes in the environment and to modify the number of atoms in the dictionary accordingly. Our approach proposes a solution to the following three tasks:

• Change detection: the process of determining if the generative process behind incoming batches of signals is the same of the previous encountered ones or has undergone to changes.

• Pruning of the dictionary: identify and remove the atoms that are not useful for representing the new process.

• Learning new atoms: the extraction, from incoming batches of signals, of new atoms that can improve the quality of the representation of new processes in terms of reconstruction error and sparsity.

We expect that actively monitoring the presence of changes in the streams of signals can be a better indicator of the necessity of adapting the dictionary compared to the passive approach used in [2]. Hence, we extend the solution presented in [3], initially designed for anomaly detection in offline settings, in order to operate in online settings for change detection. Each signal is mapped to a lower-dimensional space by means of indicators, i.e., the reconstruction error and the sparsity of the representation, in order to estimate a probability distribution function (p.d.f.). Given the online settings and the fact that the indicators strongly depends on the dictionary we are learning, there is the necessity to update the estimated p.d.f. for including the latest batches. Hence, we considered the work presented in [4], which models the distribution of the indicators through an ensemble of Randomized-Space Trees and allows updating the density model using only the more recent indicators. Assuming that the current p.d.f. does not fit the indicators obtained from signals coming from a different generative process, we propose to detect changes comparing the average log-likelihood between two consecutive batches through a statistical hypothesis test. When the difference is larger than a given threshold, our algorithm triggers the adaptation mechanism. After that, we decide to determine the number of atoms added and removed from the dictionary based on the entity of the detected change. The idea is that the more significant is the change, the more impactful it should be the adaptation phase, resulting in a larger number of atoms pruned and inserted.

For dictionary pruning, we propose to assign to each atom a score. This score represents the error introduced in the reconstruction of the incoming process if

(17)

1.1. Structure of the thesis 9

that atom would be removed. Thus, we expect that atoms with a lower score can be discarded since they do not contribute significantly to representing the new process. Then, we define a threshold based on the change magnitude and measured scores, and we remove the atoms having the score below that value.

Finally, for learning the new atoms to insert into the dictionary, we decide to start from a random dictionary and alternate several times the computation of the sparse representation together with a dictionary update phase. The idea is that inserting atoms learned directly from batches of signals can achieve lower recon-struction error and sparser representation compared to expanding the dictionary by randomly generated atoms as happens in [2]. Moreover, we suppose that signals coming from the same process share a common sparse representation. Hence, we impose on the Basis Pursuit Denoising formulation an additional regularizer based on the joint-sparsity in order to achieve a better generalization of the incoming process.

We tested our algorithm on three datasets: a synthetic one, one composed of patches coming from textured images, and one composed of patches coming from real-life images. The first dataset is used to study how our algorithm per-forms varying the parameters controlling the sensibility to changes and the number

of atoms added and removed. The last two datasets have been used to

com-pare our algorithm with the fixed-size non-adaptive Online Dictionary Learning citemairal2009online and with the Neurogenetic-ODL [2], which also implements a form of adaptation. The obtained results denote that our algorithm is able to detect changes in the presence of highly-structured signals, for example, patches of texture, achieving lower reconstruction error, sparser representations with smaller dictionaries. The obtained results indicate that in the presence of highly-structured signals, for example, patches of texture, our solution achieves lower reconstruction error, sparser representations with smaller dictionaries compared to other solu-tions. However, the indicators used for change detection do not seem appropriate for detecting a change of process given by switching the class of images, for ex-ample, passing from images of buildings to animals. This difficulty in detecting changes results in an excessive number of adaptation processes and an increased representation error compared to other solutions.

1.1 Structure of the thesis

This thesis is structured as follows:

• In Chapter 2, we review some concepts of sparse representations, dictionary learning and novelty detection; moreover, we investigate other works related to our.

• In Chapter 3, we provide a formal statement of the addressed problem. • In Chapter 4, we present our dictionary learning algorithm for learning in

nonstationary environments.

• In Chapter 5, we test the performance of the algorithm, introducing the datasets used and the methodology; we also discuss the obtained results.

(18)

• In Chapter 6, we recap the contributions of our work, pointing out possible improvements and presenting our conclusions.

• In Appendix A, we provide the mathematical derivation of the Online Dic-tionary Learning formulation.

(19)

Chapter 2 Theoretical Background

2.1 The Sparse Model

The Sparse Model, also called Sparse-Land Model in [5], assumes that a signal s _{∈ R}m can be represented using a linear combination of basis vectors contained in a matrix D_{∈ R}m×n, having, in general, m < n. This representation x∈ Rn _is

called sparse representation or code. Figure 2.1 provides a graphical toy example of this concept.

s= Dx (2.1)

However, (2.1) refers to an underdetermined linear system of equations, which has an infinite number of solutions as long as the vector s belongs to the span of the columns of D. To obtain a unique solution, it is imposed a regularization function representing a prior knowledge of the signal; thus (2.1) becomes a minimization problem PJ:

(PJ) : argmin x

J(x) s.t. Dx = s (2.2)

It is often realistic to assume that the signal s derives from a measurement of a sensor; then (2.1) is expanded including a term η ∼ W N(0, σ2_{) representing a}

white noise:

s= Dx + η (2.3)

In this case, the equality constraint is relaxed, and the solution will be an approx-imation of the signal bounded by the norm of the errorε:

(P_Jε) : argmin

x

J(x) s.t.ks − Dxk2

2 ≤ (2.4)

Finally, an equivalent formulation can be obtained by exploiting the Lagrange multiplier theorem, substituting the explicit constraint with a constantλ

(P_Jε) : argmin x 1 2ks − Dxk 2 2+ λJ(x) (2.5) 11

(20)

D s

=

x

Figure 2.1: Example of sparse representation.

2.1.1 The

P

₀ε

Problem

The most straightforward approach to obtain a sparse solution is to impose a regularization that penalizes vector with a large number of non-zero coefficients. In this case,Jxis replaced by the`0-pseudo-norm1 of the representation x, defined

as

kxk0 := #{i | xi 6= 0} (2.6)

obtaining the P0 problem

(P₀ε) : argmin x 1 2ks − Dxk 2 2+ λkxk0 (2.7)

Unfortunately, the problem is NP-Hard, since for recovering a solution which is s-sparse, we must evaluate, in principle, all the m_s combinations of the columns of D. This issues is solved by greedy algorithms, which build the support of the solution selecting one atom at a time.

One of the first algorithms proposed is the Matching Pursuit (MP) [6] sum-marized in Algorithm 1. Starting from a null solution, the algorithm selects the atom dj∗ which minimizes the distance atom-residual (lines 3 and 4). Then, its

coefficient in the code is updated by summing the projection of the residual on dj∗ (line 6). The algorithm ends when the residual is below a given thresholdε or

when the cardinality of the solution reaches the given maximum sparsity level L. An improvement to MP is the Orthogonal Matching Pursuit (OMP) [7]. Whereas the former updates one atom at a time, the latter updates all the coefficients of the solution through Least Square. This modification prevents to select an atom already present in the support more than once.

2.1.2 The

P

₁ε

Problem

In order to make the problemPε

0 more tractable, a possible solution is to replace the

discontinuous and discrete`0-pseudo-norm with a smooth or convex approximation.

Among the different`p-norms, the best choice isp = 1 since it is the closest convex

approximation. Other choices as p < 1 leads to non-convex problems, whereas p > 1 yields non-sparse solutions. A comparison between different norms is present in Figure 2.2.

1_{It does not satisfy the homogeneity requirement}

(21)

2.1. The Sparse Model 13

Algorithm 1: Matching Pursuit Algorithm

Input: The signal s; the dictionary D; the error threshold ε; the maximum `0 norm of the solutionL

Output: The sparse solution xk // Initialization 1 x0 ← 0, r0 = s, ω0 = Support{x0} ← ∅ 2 repeat // sweep 3 compute δ(j) =krk2₂− (r T_dj)2 kdjk2 2 4 select j∗ = argmin δ(j) // update 5 ωk ← ωk−1∪ j∗ 6 xk_j∗ ← xk+ rTdj∗ kdj∗k2 7 rk ← rk−1− Dxk 8 until kxkk₀ ≤ L or krkk₂ ≤ −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 p=0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 p=0.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 p=1 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 p=2

Figure 2.2: Comparison between different `p-norm. From left to the right: p = 0, p = 0.5,

p = 1, p = 2.

(P₁ε) : argmin

x ks − Dxk

2

2+ λkxk1 (2.8)

Figure 2.3 provides a geometric interpretation of howPε

1 recovers sparse solution

with a toy dictionary composed of two atoms. Let us consider the point denoted with ˆs as the noisy observation of a signal, with the red ellipses denoting the contours of the least squares error function. The blue diamond indicates the `1

ball which represent the regularizer of the `1-norm, weighted by the coefficient λ

which is a trade-off between representation error and sparsity of the representation. When the minimum of (2.8) happens in a corner of the`1 ball, either d1 or d2 are

zero, resulting in a sparse representation. The problem Pε

1 in (2.8) is also known as Basis Pursuit Denoising (BPDN) [8]

or Least Absolute Shrinkage and Selection Operator (LASSO) [9]. The problem is not strictly convex; thus, there are multiple local minima; moreover, it does not admit a closed-form solution, because`1 is not differentiable in 0. However, under

certain conditions related to the dictionary, the value of the global minimum is unique and equal to the one obtained solving theP0 problem.

(22)

old-d1

d2

s

ˆ x

Figure 2.3: Graphical representation of the_P₁εproblem. The red ellipses represent the residual

least square solution, while the blue diamond is the `1-norm.

1 2 2 0 1 2

Figure 2.4: Example of the LARS algorithm. The solution starts at x0_{, then evolves along}

the atomd1 which is the most correlated with the residual. Then, at x1, the atom d2 is as

correlated asd1, thus the solution evolves along the direction that preverves the equiangularity.

est is the Iterative Reweighted Least Squares (IRLS) [10], which solves Pε

1 by

iteratively replacing the `1-norm with a weighted `2 approximation. The first

application for computing sparse representations was in the FOCUSS algorithm [11]. Other approaches exploit proximal gradient methods, such as the Iterative Thresholding Algorithm (ISTA) [12] or the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [13]. The Alternating Direction Method of Mul-tipliers (ADMM) [14] deals with the difficulty of the non-differentiability of Pε 1

by introducing another variable and an equality constraint, splitting the original problem into several problems easier to solve.

To solve the Pε

1 problem in this thesis, we resort to the Least Angle

Regres-sion (LARS), in particular the LASSO-LARS verRegres-sion presented in Algorithm 2. LASSO-LARS is a forward selection algorithm similar to OMP. The algorithm starts with an empty solution, then, at each iteration selects the most correlated atom dj∗ with the residual rk (line 5). The chosen atom influences the directionδk

along which the solution evolves. Thus, the atom is inserted into the active set_Ak and the direction δk _{(lines 6 and 7) is computed. When there is a single atom in}

(23)

2.2. Dictionary Learning 15

the active set,δk _{points in the direction defined by d}

j∗. Otherwise, it points to the

joint least square direction between the residual and the atoms in the active set. This direction can also be considered as the direction that makes the least angle with the residual, from this fact derives the name. The coefficient of the solution then evolve along δk _{until another atom d}

l, not present in Ak, becomes the most

correlated atom with the current residual rk (lines 8–14). Lines 10–12 are specific of LASSO-LARS, causing the removal of an atom from_Ak if its coefficient goes to zero. The direction δk _{is updated accordingly. The algorithm stops when all the}

atoms have entered in the active set at least once. A toy example of the algorithm is shown in Figure 2.4.

Algorithm 2: LASSO-LARS

Input: Signal s; dictionary D= [d1, . . . , dn], step size λ

Output: sparse solution x // Initialization 1 k ← 0 2 rk ← s, xk ← 0, Ak← ∅ // Main loop 3 repeat 4 k ← k + 1 5 j∗ ← argmax_jdT_jr 6 Ak ← Ak−1∪ {j∗} 7 δk ← (DT AkD_Ak)−1D_Akrk

// Move along direction

8 repeat 9 xk Ak ← xk_Ak+ λδk 10 if ∃ xk Ak_,j = 0 then 11 Ak ← Ak\ {j} 12 δk ← (DT AkD_Ak)−1D_Akrk 13 rk ← rk− D_Akx_Ak 14 until l = argmax_jdT_jr /∈ Ak 15 until all n atoms have entered once

2.2 Dictionary Learning

A dictionary D is a matrix m × n, whose columns are atoms designed in order

to represent other vectors in the same space in a sparse manner. In general, the number of columnsn is greater than the dimensionality of the signal m, resulting in a non-orthogonal and non linearly independent set of basis.

There are two possible methods for designing a dictionary:

• analytically, the dictionary is built using a predefined set of bases, such as DCT, wavelets, curvelets, or many others. The advantage of these dictio-naries is the computational efficiency of their construction. In contrast, they

(24)

are typically limited in the ability of represent sparsely a signal; in addition, most of these dictionaries are restricted to signals of a specific type.

• learned from data, starting from a training set of signals of interest, the dictionary atoms generate an empirical model which describes the internal structure of the learned signals. As opposed to analytically-designed dic-tionaries, they have a significant computational requirement. However, the learning method can provide dictionaries for any class of signals that can be described with sparse representation. We focus on this latter kind of dictionaries.

While the concept of transforming signals into their sparse representation can be traced back to the 1960s [15], the idea of dictionary learning dates back to 1990s with [16]. In this work, a dictionary is learned from patches of natural images. Subsequently, many authors have proposed different solutions to the dictionary learning problem, two of the most famous are the Method of Optimal Direction or MOD [17] and the K-SVD [18].

The general process for finding a dictionary D requires a training set S_{∈ R}m×N composed of the signals of interest s. Then, the process starts to solve the following optimization problem argmin D,X 1 N N X i=1 1 2ksi − Dxik 2 2+ λkxikp wherep∈ {0, 1} (2.9)

Due to the inter-dependence between D and x, this problem is not convex; the great majority of the algorithms adopt a block coordinate descend approach, alternating a sparse representation phase keeping fixed D followed by a dictionary update phase with x fixed.

2.2.1 Online Dictionary Learning

In some applications, the collection of signals S could be not completely available at the beginning but arrive in small batches or one by one; other applications rely on signals which evolve during time; finally, the training dataset cannot be stored in the physical memory.

The Online Dictionary Learning (ODL) method [1] solves this problem. This method trains the dictionary with the usual alternation sparse representation -dictionary update. However, it considers a single signal st, or more generally a

mini-batch, at a time. Also, it uses two special matrices A and B for compactly representing the previously learned signals and their encoding history.

The ODL method is presented in algorithm 3. It starts with an initial dictionary D0, e.g. a randomly initialized one, and resetting the matrices A and B (line 1).

At each iteration t, the algorithm computes the sparse representation xt of st

w.r.t. the previous dictionary D_t−1. The sparse representation is obtained by solving (2.8) through LARS (line 4). In lines 5 and 6, the memory of the algorithm is updated for including the current signal, in particular At receives the

cross-correlation between the sparse representation x, whereas B_t receives the auto-correlation between st and xt. To conclude the iteration, the dictionary Dt−1 is

(25)

updated (lines 7–11) minimizing the surrogate function ˆft (2.10). The update

occurs through block coordinate descend, optimizing each atom while keeping the others fixed, until convergence. As a measure to prevent arbitrary large values, the atom is normalized to be inside the unit hypersphere. A mathematical derivation of (2.11) and of the optimization process is present in appendix A.

ˆ ft = 1 t t X i=1 1 2ksi− Dxik 2 2+ λkx1k (2.10) = 1 2Tr D T tDtAt − Tr DTtBt (2.11)

Algorithm 3: Online Dictionary Learning

Input: Signal stream s1, . . . , sN ∈ Rm; initial dictionary D0 ∈ Rm×n;

`1-norm penaltyλ Output: dictionary D 1 A₀ ∈ Rn×n← 0, B₀ ∈ Rm×n ← 0 2 for t = 1 to N do 3 input s_t // Sparse representation 4 x_t← argmin_xks − D_t₋₁xk2₂+ λkxk₁ // Update memories 5 A_t← A_t−1+ x_txT_t 6 B_t ← B_t₋₁+ s_txT_t // Dictionary update 7 repeat 8 for j = 1 to n do 9 u_j ← _ajj1 (b_j − Da_j) + d_j 10 d_j ← 1 max(kujk2,1)uj 11 until convergence 12 return D_t

2.2.2 Adaptive-Size Dictionary

Given a dictionary D _{∈ R}m×n, learned from a training set S _{∈ R}m×N, we denote withn the number of atoms. In many applications based on sparse representation, the number of atoms in a dictionary has significant impacts on both their represen-tation performance and compurepresen-tational efficiency [19]. A learned dictionary with an excessively small number of elements will barely yield an accurate representation, underfitting the training set. On the other hand, a large dictionary reaches a lower reconstruction error; however, it has several drawbacks. OMP, LARS, and other algorithms used for computing the sparse representation depend on the number of atoms of the dictionary. Moreover, the presence of atoms similar to each other

(26)

makes more difficult the task of finding the optimal sparse representation. Finally, there is the issue of overfitting the training set.

In general, the decision of the optimal value of n is achieved through a val-idation process and selected the best tradeoff between representation error and computational performance. However, exists few study which address the question of how learning a dictionary with the optimal number of atoms [19, 20, 21].

In mathematical terms, the problem can be formulated as follows: min D∈Rm×n X∈Rn×N n∈N 1 N N X i=1 1 2ksi− Dxik 2 2+ λkxikp+ µn where p∈ {0, 1} (2.12)

The solution of (2.12) is more complicated than (2.9), due to the discrete con-straint imposed by n. As attempt to solve (2.12), many different heuristics have been proposed [22]. In the following subsections, we are going to describe some of these heuristics. Although the majority of them have been designed considering offline dictionary learning, they can offer useful insights both on how to generate atoms from signals and how to reduce the dictionary size removing the less useful atoms. As noted in [2], modifying dynamically the dictionary size can be a feasible solution for learning in a non-stationary environment.

Atom Generation

A preliminary distinction between the algorithms is at which stage they decide to expand the dictionary. The increment may happen periodically, as in Stagewise K-SVD [23] or Incrementally Build Dictionary Learning [24], or after reaching a specific condition. For example, in the Neurogenesis-Inspired DL [2] this condition is when the Pearson correlationr between the signal s and its reconstructed version ˆs = Dx falls below a threshold. Instead, [19] measures the trend of the average number of non-zero coefficients in the sparse representation X relative to a training set S. If the trend is stable or positve, the algorithm proceeds to insert atoms.

After the decision of when insert atoms, it is necessary to establish a procedure to generate them. The most straightforward choice is to insert randomly generated atoms [2, 21]. In general, new atoms derives directly from signals or after applying a transformation to them. Stagewise K-SVD [23], which operates in offline settings, chooses at each iteration a percentage of the worst represent signal, then applies the Singular Value Decomposition to them. The atoms inserted are the firstk left

singular vectors, with k defined as hyper-parameter. IK-SVD [25], when a new

batch of signal Si = [s1, . . . , sr]is received, adds a fixed number of those signals

as-is to the dictionary. The signals are selected in order to maximize the information entropy between the current dictionaries. The entropy is computed for each sparse representation Xi = [xi, . . . , xr] H(xi) = n X j=1 xij Pn k=1xik log xij Pn k=1xik (2.13)

denoting with xij the j-th coefficient of the i-th sparse representations. Other

atom generation approach consist of applying clustering techniques to signals, and choosing cluster centroids as atoms [26, 27].

(27)

A completely different method for generating atom is to select an atom and replacing it with two new atoms, generated from it, following a specific criterion. Examples of this idea are DL with Efficient Number of Elements [19] and Incre-mentally Built DL [24]. In the first, the atom d_j is splitted into d+_j and d−_j which differs by a single coefficientdji. The new atoms are computed in order to minimize

the representation errorej corresponding to dj

ej = S−

X

k6=j

djxTk (2.14)

In [24], is recorded the number of time u that an atom is selected for the sparse representation. Periodically, two index i and j are extracted from a generalized Bernoulli having as parameter the usage statistic. The least used atoms is resam-pled from a Gaussian distribution_{N (d}i, σ1) and a new atom is inserted generated

from_{N (d}j, σ1).

Deletion Phase

During the learning can happen that an atom is rarely or never used. There

can be multiple causes; for example, the atom has learned a particular structure which appears rarely, or the dictionary update phase sets it to zero, causing the impossibility of learning new structures.

The importance of the atoms can be described in at least two ways. Given the sparse representation X _{∈ R}n×N relative to a training set S _{∈ R}m×N, we denote with xT_j the j-th row of X. The representation power pj of an atom dj can be

defined as

pj,2 =kxTjk22 (2.15)

which gives more emphasis to atoms present with large coefficients in X, used in [23, 21], or

pj,0 =kxTjk0 (2.16)

which promotes atoms that appear often with non-zero coefficients in X, used in [19]. Atoms having lower representation power can be removed from the dictionary without increase the approximation error significantly.

Another consideration regards the similarity between two atoms. It is unlikely that two similar atoms are chosen together for the same sparse representation of a signal; thus the atom with lower representation power can be removed or they can be merged, e.g, computing the mean value for each coefficient. The similarity between two atoms can be computed directly from the coherence measure, defined as µij = |d T idj| kdikkdjk , 0_{≤ µ}ij ≤ 1 (2.17)

with µij → 1 denoting two very similar atoms which can be merged as happen in

[28]; or applying the Mean-Shift Clustering to the dictionary atoms and replacing clusters of atoms with their centroids [20].

(28)

2.3 Change Detection

Given a stream _{G = {g}t, t = 1, . . . ,}, gt∈ Rd of realizations of a random variable,

the change detection problem refers to detect the presence of a change pointτ such that gt∼ ( Φ0, t < τ Φ1, t≥ τ (2.18) where _{gt, t < τ} are i.i.d. and Φ0 6= Φ1. The change is denoted as Φ0 → Φ1.

Change detection usually operates on lower-dimensional signals or on features ex-tracted from the incoming data stream of signals. As noted in [29], signals living in high-dimensional spaces contains deterministic structures that cannot be repre-sented by a single model.

According to [30], most of the change detection approaches can be grouped into four classes: Hypothesis Test, Change-Point Method, Sequential Hypothesis Test and Change Detection Test. Hypothesis Test (HT) asses the validity of an hypothesis (2.19) on a given confidence level. They operate with fixed-sequences of signals, i.e., by partioning the available dataset into two sets and performing a statistical test. Commonly used tests are the Z-test for normal distribution of known variance or t-test for normal distribution of unknown variance. In case of unknown distribution non-parametric tests like the Mann-Whitney U-test or the Kolmogorov-Smirnov test can be used.

H0 : gt∼ Φ0 vs H1 : gt∼

(

Φ0, if t < τ

Φ1, if t≥ τ

(2.19)

Change-Point Methods (CPM), like HT, operates on fixed-sequences of signals. Since CPM detect the presence of a change by analizing all the possible partition of a sequence, they can also provide the time instantτ where the change occurred. The main drawback of these methods is the high computational demand required for analyzing all the partions. Differently from HT and CPM, Sequential Hypoth-esis Tests are able to analyze one incoming sample at the time until they have enough statistical confidence to accept or reject the hypothesisH0. Once the

deci-sion is taken, they stop to analyze the datastream. Example of this technique are the Sequential Probability Ratio Test or the Repeated Significance Test. Change Detection Tests (CDT) are able to perform a fully-sequential analysis and to keep operating also after a change affects the generative process. Since they continuosly monitor a data stream, the are characterized by a low-computational demand, and the detection is typically performed by checking wheter a feature value exceeds a threshold. Example of this category are the CUSUM [31] and the Exponential Weighted Moving Average (EWMA) [32].

2.3.1 The RS-Forest Density Estimation Method

As noted in [33], given a multivariate datastream _{G = {g}t, t = 1, . . . ,}, gt ∈

Rd, d > 1, the change Φ0 → Φ1 can be detected through the comparison of the

(29)

2.3. Change Detection 21

Wr. The former represents past data, assumed to be generated by Φ0, whereas

the latter contains the most recent ones, which could be generated fromΦ1. This

comparison can be performed through a statistical test, such as the t-test or the Kolmogorov-Smirnov. Traditional methods for this task are the Gaussian Mixture Model (GMM) or the Kernel Density Estimation (KDE). However, in an online stream of data, it is not unusual to observe a continual evolution of normal and ab-normal events with drifting concepts, requiring a continual update of the estimated distribution.

For addressing these problems, we have considered the solution presented in [4]. Despite the fact it is designed for supervised anomaly detection, it proposes a non-parametric model for density estimation through an ensemble of RS-Trees called Random-Space Forest (RS-Forest). The model operates in one window of fixed size, called latest window. Each time a sample arrives, the model computes the related anomaly score and stores the sample in the window. Once the window is full, the model is updated. For efficiency purposes, the RS-Forest employs a dual node profile technique, one for scoring and the other for speed-up the update. It means that during the scoring of sample, the RS-Forest records the number of samples ended in the same termination node.

An RS-Tree is a full binary tree which partition the feature space_{X by randomly} selecting an attribute and a cut point the the selected attribute. Each node is composed of:

• three pointer, one pointing to the parent node and two pointing to the left and right child;

• a variable v storing the log-scaled ratio of the node-volume to the volume of the entire feature space;

• two variables ml and mr, called node profile, used alternately for scoring

in-coming samples and for updating the number of samples in the latest window that have traversed such node.

The construction of an RS-Tree does not rely on a training dataset, but it needs

only the lower and upper boundaries L _{∈ R}d and U _{∈ R}d of _{X , which can be}

estimated directly from the first incoming samples. Algorithm 4 builds the tree

in a recursive manner, ending when the predefined maximum depth H is reached

(line 3). In lines 6–8, the algorithm selects randomly an attribute q _{∈ {1, . . . , d}} of the feature space, then a random value r _{∈ [0, 1] is selected for determining a} cutting point p according to (2.20).

p = Lq+ r(Uq− Lq) (2.20)

Then, the algorithm proceeds by generating the two children. The left child gets the subspace delimited by the interval [Lq, p], whereas the right child gets the

interval[p, Uq] (lines 10 and 13). This algorithm is repeated until a RS-Forest with

a predetermined number of treesM is obtained.

For each incoming instance gt, the density estimated by the RS-Tree T is

ˆ

f (gt, T ) = |`(gt

, T )_| N_V(`(gt, T ))

(30)

Algorithm 4: Build RS-Tree

Input: lower bound values of attributes L_{∈ R}d; upper bound values of attributes U_{∈ R}d current tree depth e; tree depth limit H

Output: RS-Tree T

1 Function BuildTreeStructure(L, U, 0, H):

2 if e ≥ H then

3 return leaf (m_l ← 0, m_r ← 0)

4 else

5 initialize a non-leaf node root

6 randomly select an attribute q∈ {1, . . . , d}

7 randomly select a number r∈ [0, 1]

8 cut-point p← L_q+ r (U_q− L_q)

9 t← U_q, U_q← p

10 lef t← BuildTreeStructure(L, U, e + 1, H)

11 lef t.v ← r, left.parent ← root

12 U_q ← t, L_q← p

13 right← BuildTreeStructure(L, U, e + 1, H)

14 right.v ← 1 − r, right.parent ← root

15 return root(lef tChild← left, rigthChild ← right, splitAtt ←

q, cutP oint← p, ml← 0, mr ← 0)

where`(gt, T ) is the termination node,|·| represent the number of points contained

in the hyper-rectangle represented by the termination node and_{V is the volume of} such hyper-rectangle. The termination node `(gt, T ) can be either a leaf node or

the first node having _|`(gt, T )| ≤ ζ, with ζ denoting a predefined node size limit.

The density estimated by the RS-Forest F is then:

ˆ f (gt, F ) = 1 M M X i=1 ˆ f (gt, Ti) (2.22)

Algorithm 5 describes the update of the RS-Forest. The update is performed once the window containing the incoming samples is full, and each samples has received its true label, i.e. normal or anomaly. Lines 3–7 refers to anomalous samples. In such case, the algorithm traverses the algorithm towards the root node decreasing the node profile of each node encountered. In contrast lines 8–14, if the termination node is neither a leaf or has reached the node size limit, increase the node profile from the termination node towards the leaf.

(31)

Algorithm 5: Update RS-Tree

Input: RS-Tree T ; instances for updating the tree G ∈ Rd×η_{; indicator}

for which profile will be updated LR; node size limit ζ Output: None

1 Function UpdateTree(T, G, LR, ζ):

2 split G into list N and A based on label info; N (A) contains the

indexes of normal (anomalous) instances in the latest window

3 foreach g, nodeList∈ A do

4 ancestor = nodeList[i]

5 while ancestor not exists do

6 LR? ancestor.m_l− − : ancestor.m_r− −

7 ancestor ← ancestor.parent

8 foreach g, nodeList∈ N do

9 LR? n = nodeList[i].m_l : n = nodeList[i].m_r

10 if n > ζ and nodeList[i] not leaf then

11 child = NextChild(child, g)

12 while child not NULL do

13 LR? ancestor.m_l+ + : ancestor.m_r+ +

(32)

2.4 Dictionary Learning in Nonstationary

Environ-ment: Related Works

Over the years, many authors have proposed applications of dictionary learning techniques in different fields. In this thesis, we focus on dictionary learning in nonstationary environments, in particular, how they react to modifications in the environment. Another point of contact with our work is the application of sparse representation techniques for determining changes or anomalies in the environment. Related works to our algorithms can be found in the application of dictionary learning techniques for representation learning, i.e., the field of machine learning that studies how to extract useful features from input data in order to improve tasks like classification or prediction [34]. In nonstationary environments, the probability distribution of the explanatory features for a given signal can be subject to drifts. The dictionary, then, must be able to adapt its size for accounting the presence of new features for explaining incoming signals.

The Neurogenetic Online Dictionary Learning (NODL) [2] , proposes a dictio-nary learning algorithm able to adapt to sudden changes in the environment while maintaining atoms related to signals before the changes. As it is possible to notice in Algorithm 6, NODL starts from the ODL framework and expands it with the code encased by the gray boxes. The first box (lines 5–10), takes care of the neu-rogenesis process. It computes the Pearson correlationr(st, Dt−1x− t) between an

incoming sample stand its approximationˆst= Dt−1xt based on the current

dictio-nary D_t−1. If r(_{·) drops below a threshold γ, the neurogenesis is triggered, then a} number of atoms proportional to the error 1_{− r(·). The new atoms are randomly} generated from a normal distribution, without additional consideration on incom-ing signals. The dictionary update phase is obtained by minimizincom-ing (2.23), which includes to (2.11) a`1/`2-regularization term and an`1-regularization. Lines 17–19

provides the operation for accounting these additional regularization. Dt= argmin D 1 2Tr D T_DA t − Tr DTBt + λg n X j=1 kdjk2+ n X j=1 λjkdjk1 (2.23)

The `1/`2-regularization, proposed in [35], sets to zero atoms having norm below

a threshold λg, given as input. This operation is necessary for removing a

possi-ble surplus of atoms from the neurogenesis. The `1-regularization imposes atoms

sparsification for accounting a plausible sparse architecture in biological networks [2].

Efficient Lifelong Learning Algorithm (ELLA) [36] is an online supervised multi-task learning which aims to learn a shared basis L _{∈ R}d×k for a series of task Z(1)_{, . . . ,}_Z(Tmax)_{. This shared basis allows to obtain a sparse representation x}(t) _∈

Rk of an incoming instance s(t) ∈ Rd of a task Z(t). The knowledge of previously learned tasks is used to aid the learning of new ones. In this work, we are going to consider the special case of ELLA for the ODL problem. The main differences from [1], are the supervised setting and the usage of memory matrices A and B. The former means that each incoming sample has a label denoting the task. This information is used in the update of A and B. Denoiting with S(t)_n−1 _{∈ R}d×n−1 the set of samples related to a task t encountered so far and with X(t)_n₋₁ _{∈ R}k×n−1

(33)

2.4. Dictionary Learning in Nonstationary Environment: Related

Works 25

Algorithm 6: Neurogenetic Online Dictionary Learning

Input: Signal stream s1, . . . , sN ∈ Rm; initial dictionary D0 ∈ Rm×n;

`1-norm penaltyλ; group sparsity regularization λg; conditional

neurogenesis threshold γ Output: dictionary D 1 A₀ ∈ Rn×n← 0, B₀ ∈ Rm×n ← 0 2 for t = 1 to N do 3 input s_t // Sparse representation 4 x_t← argmin_xks − D_t₋₁xk2₂+ λkxk₁ 5 // Conditional neurogenesis 6 if p_c(s_t, D_t₋₁, x_t)≤ γ then 7 k_n= (1− p_c(s_t, D_t₋₁, x_t))c_k 8 D_n ← initializeRand(k_n) 9 D_t₋₁ ←D_t₋₁ D_n 10 A_t₋₁ ←At−1 0 0 0 , Bt−1 ← Bt−1 0 0 0 , n_{← n + k}n 11 // Update memories 12 A_t← A_t₋₁+ x_txT_t 13 B_t ← B_t−1+ s_txT_t // Dictionary update 14 repeat 15 for j = 1 to n do 16 u_j ← 1 ajj(bj − Daj) + dj

17 // Atoms sparsification and useless atom removal

18 vj ← P roxλjk·k1(uj) = sgn(uj)(|uj| − λj)+ 19 w_j ← v_j 1− _kvjk2λg + 20 d_j ← 1 max(kwjk2,1)wj 21 until convergence 22 return D_t

(34)

the relative sparse representation, when a new sample s(t)n arrives, the algorithm

performs the following operations:

1. remove from A and B information related to S(t)_n−1: A = A − X(t)n−1X (t)T

n−1,

B= B_{− S}(t)_n₋₁X(t)_n₋₁T

2. insertion of the new sample: S(t)n =

h

S(t)_n₋₁s(t)n

i

3. compute the sparse representation X(t)n of S(t)n w.r.t. L based on:

X(t)_n = argmin

X kS

(t)

n − LXk22+ λkXk1 (2.24)

4. update of A and B with the new sparse representation of the samples of taks t: A = A + X(t)n X(t)

T

n , B= B− S(t)n X(t)

T n

The necessity of keeping in memory the sample and relative sparse representation leads to losing the efficient use of physical memory introduced by [1].

A dictionary D can be used for modelling in a sparse manner the family of signals used for training. This model is useful for collecting lower-dimensional features from test signals, which can be useful for detecting changes for determining anomalies or changes in the generative process. For this reason, we investigate other algorithms that detect anomalies with sparse representation techniques.

In general, detection of anomalies is performed by associating a cost to each test signal and marking as anomalous if such cost is greater than a user-defined threshold τ . In [37], the dictionary is obtained through dictionary selection. It starts from a given candidate feature pool B _{∈ R}m×k specific for video, then the selection is perford by minimizing:

argmin X 1 2kB − BXk 2 F + λkXk2,1 (2.25)

the regularizer_{k · k}2,1 performs a grouping effect causing some rows in X to be set

to zero, thus obtaining an initial dictionary B0 _{∈ R}m×n composed of the features having a non-zero row. Then, it is expanded resulting in D= [B0_I

m×m] for

account-ing deformation or other alteration that may happen in a video. For performaccount-ing anomaly detection, it is proposed the Sparse Reconstruction Cost formulated as:

1

2ks − Dxk

2

2+ λkWxk1 (2.26)

where x_{∈ R}n is the sparse representation of the signal s_{∈ R}m and W _{∈ R}n×n is a diagonal matrix that assigns to each atom dj a weight wj ∈ [0, 1], with smaller

values denoting a very likely presence of that atom in normal samples. The idea behind (2.26) is that normal samples have lower cost, due to lower reconstruction error and weighted `1-norm, than anomalous ones. Similar detection techniques

are present also in [38, 39, 40], which also considers a dictionary update phase for improving the representation ability of the dictionary with the arrival of new signals.

(35)

2.4. Dictionary Learning in Nonstationary Environment: Related

Works 27

[41] performes anomaly detection proposing a different model for representing a signal s_{∈ R}m:

s= Dx + e + v (2.27)

where D _{∈ R}m×n is trained on normal data, x _{∈ R}n is the sparse representation and v_{∈ R}m is the noise component. The purpose of e_{∈ R}m is to collect the signal components that cannot be sparsely approximated, in this way anomalous signals have an higher value of _kek2 w.r.t. normal signals. For each signal, the algorithm

minimizes (2.28) obtaining x and e. The signals is detected as anomalous if _kek2

is greater than a user-defined valueτ . x, e = argmin

x,e ks − Dx − ek

2

(36)

(37)

Chapter 3 Problem Formulation

3.1 Signal model

We consider a signal s_{∈ R}m generated by a stationary stochastic process _P.

s_{∼ P} (3.1)

We assume that the signal can be modelled as a liner combination of atoms ex-tracted from a dictionary D

s= Dx (3.2)

with D_{∈ R}m×n, x _{∈ R}n. The representation x is sparse, i.e., _kxk0 = l n.

3.2 Definition of nonstationary environment

We define a nonstationary environment as a stream_{S of signal s}i (3.1) that arrives

over time. We assume that a process_Pi generates Ni consecutive signals, with Ni

unknown. Then, the generative process undergo a change _Pi → Pi+1,Pi 6= Pi+1

which, in turn, generates a unknown numberNi+1of signals. We can define signals

generated from the same process as a collection Si ∈ Rm×Ni obtaining the following

definition for the stream_S:

S =n{Si}_i=1,... | Si ∼ Pi, Si ∈ Rm×Ni

o

(3.3)

We assume that a shift of process _Pi → Pi+1 implies the change of the dictionary

D that models the signal s:

s=

(

Dix, s∼ Pi

Di+1x, s ∼ Pi+1

(3.4)

where Di+1 consists of a part of atoms coming from Di ∈ Rm×n together to a set

of new atoms_{dn+1, . . . , dn+k} that cannot be sparsely represented from Di

(38)

3.3 Goals of this work

Our goal is to learn a dictionary D from the stream of signal _{S defined in (3.3).} The learning process must be able to detect automatically changes of the generative process _Pi → Pi+1 operating in a batchwise manner. For maintaining the online

requirement, we define a batch of signals as Si ∈ Rm×η with η Ni, i.e., the

number of signals considered at once should not exceed the window of stationarity of the process. The definition of the optimal value ofη is outside the purpose of this work. As additional constraint, the learning process must be able to distinguish the belonging of Si to the current process _Pi or to a new process Pi+1 based only

on the previously encountered signals.

After detecting a change, the learning process must modify the size of the current dictionary Di. This modification allows to insert into Di new atoms

{dn+1, . . . , dn+k} for represent sparsely signals coming from Pi+1. Given the online

nature of the algorithm, we are interested in an atom generation process capable of extracting relevant features from incoming batches, and providing a rapid decrease both in approximation error and sparsity. In order to prevent the dictionary size to steadily increase, which might cause the issues described in Section 2.2.2, we want to keep only atoms from Di useful for representing Pi+1.

(39)

Chapter 4 Proposed Solution

Considering the problem presented in Section 3.3, we have decided to consider a top-down approach, splitting the general problem into more subproblems and reit-erating the process until reaching easier to solve problems. The main subproblems detected are:

• Change Detection: determine a strategy for detecting changes of the gener-ative process in the form _Pi → Pi+1.

• Dictionary adaptation: determine modifications of the dictionary size for representing the new process _P_i+1 and the strategy for generating atoms for a fast adaptation process.

We start describing the overall algorithm in Section 4.1, then we describe in details how we tackle the change detection subproblem in Section 4.2 and the dictionary adaptation subproblem in Section 4.3.

4.1 Algorithm Outline

Our goal is to consider the ODL framework, proven to converge in a stationary input distribution, as a baseline, and extend it to handle a nonstationary environ-ment like the one presented in Section 3.2. Considering the sequence of batches {St−1. . . St−t0} generated from the same process Pi, we can distinguish two cases

for an incoming batchSt:

1. St is generated from the same process Pi of the previous batches, or

2. Stis generated from a different process w.r.t. the previous batchesPi+1,Pi+1 6=

Pi

In the former case, we consider the algorithm as operating in the learning state, where the dictionary learns from batches of signals generated from the same pro-cess, i.e., in stationary conditions. In this state, the algorithm models the current process _Pi by estimating a probability density function ˆΦt from data-driven

indi-cators obtained from the signals. Modeling the process with a probability density function is necessary to determine the presence of a change by comparing the av-erage log-likelihood between two consecutive batches with a hypothesis testing.

(40)

When the algorithm detects a change, we compute the magnitude of such

change, denoted as γt, and we use it for determining how many atoms to

in-sert and to remove. We consider that a more significant magnitude is measured when a substantial portion of the atoms in the dictionary cannot represent the incoming process. Hence the adaptation process will replace a larger part of the current dictionary. We divide the adaptation process into two states: pruning and adaptation.

In the pruning state, we perform a selection among the current atoms, removing the ones that are not considered useful for representing the new process. The selection is achieved in the following way:

1. We assign to each atom dj a scorerj that represents the error introduced in

the reconstruction if that atom would be discarded. A lower score represents a useless atom.

2. We determine a thresholdτ based on the computed scores and the magnitude of the change.

3. Atoms having rj < τ are pruned.

Then the algorithm switches to the adaptation state.

Until the algorithm is in adaptation state, we compute the magnitude of the change between the previous batch and the current for estimating the number of atoms to insert. Then the algorithm uses incoming batches of signals for defining new atoms to insert into the dictionary. During this state, we keep track of the slope of the approximation error computed during k > 1 iterations. If this slope is larger than zero, the current dictionary is still not able to provide an accurate reconstruction of the signals of the new process; hence more atoms are needed. On the other hand, when the slope is stable or negative, adding more atoms does not increase the dictionary approximation ability in a significant way. Thus the algorithm returns in learning state. Figure 4.1 summarizes the different states and their transitions.

In the following, we provide a more detailed description of the solution pro-posed, describing the Algorithm 7. It starts resetting the memory matrices A0,

B0, and the history of the approximation error during the adaptation MSE.

More-over, it sets the current state, i.e, learning (lines 1–3). At each iterationt, a batch of signalsStis received. The algorithm computes its sparse representation Xt and

the data-driven indicators Gt considering the current dictionary Dt−1 (lines 5–7).

The sparse representation Xt is computed solving (2.8) with the LARS algorithm.

In lines 8 and 26, we wait until the dictionary has processed at least a number of signals tη equal to the dimension m of the signals considered tη > m before estimating the distribution ˆΦt. This waiting allows the structure of the atoms of

the initial dictionary D0, usually initialized randomly, to comply with the signals

of the initial process_P0, and obtaining more reliable indicators for estimating ˆΦt.

At this point (lines 9–15), the algorithm determines if it has to change the current state, according to the following rules:

• when the current state is learning, it determines if the generative process has changed from the current process_Pi to a new process Pi+1. The change

(41)

4.1. Algorithm Outline 33

Algorithm 7: Online Adaptive-Size Dictionary Learning for Nonstation-ary Environments

Input: signal streams1, . . . , sN ∈ Rm; initial dictionaryD0∈ Rm×n; max number of

atoms added per iteration C; `1-norm penaltyλ; `1/`2group sparsity penaltyλg;

proportion of the change magnitude for atom removal0 < ρ < 1

Output: dictionaryD

1 A₀← 0, B₀← 0 // reset memories

2 learning← True, pruning ← False, adapting ← False // control variables

3 MSE← 0

4 fort = 1 to N do

// Collect data-driven indicators

5 inputSt// represents the current batch

6 X_t← argmin_X1 2kSt− Dt−1Xk 2 2+ λkXk1 7 Gt←ks i t− Dt−1xitk2 kxi tk1 , i∈ {1, . . . , η}

// Determine the state

8 if tη > m then

9 if learning then

10 pruning← DetectChange(G_t, ˆΦ_t−1)

11 else

12 adapting← NeedAdaptation(S_t,X_t,D_t−1,MSE← 0)

13 if not adapting then

14 EstimatePhi(G_t)

15 learning← !adapting

// Dictionary Adaptation

16 if pruning or adapting then

17 Determine magnitude of the changeγ through (4.9)

18 if pruning then 19 Dt−1, At−1, Bt−1← RemoveAtoms(st, xt, Dt−1, At−1, Bt−1γ, ρ) 20 pruning← False 21 adapting← True 22 D_t−1, A_t−1, B_t−1← ExpandDictionary(S_t, γ, C, D_t−1, A_t−1, B_t−1) 23 X_t← argmin_X1₂kS_t− D_t−1Xk2₂+ λkXk₁ 24 G_t←ks i t− Dt−1xitk2 kxi tk1 , i_{∈ {1, . . . , η}} // Model Update 25 A_t← X_tXT_t, B_t← S_tXT_t 26 if tη > m then 27 Φ_t← EstimatePhi(G_t) 28 else 29 Φ_t← UpdatePhi(G_t) 30 Dt← UpdateDictionary(Dt−1, At, Bt, λg) 31 returnD_t

(42)

learning

pruning

adaptation

P

i

P

i

→ P

i+1 ∆M SE ∆k

≤ 0

∆M SE ∆k

> 0

Figure 4.1: The different states of the algorithm and their transitions.

detection process is explained in section 4.2. If a change is detected, the current state goes from learning to pruning.

• when the current state is adapting, it determines if Dt−1 needs additional

atoms for representing the new process_Pi+1. The process is presented in

algo-rithm 8. When the slope computed from MSE is smaller or equal than zero, we assume that adding more atoms does not contribute significatively to re-duce the approximation error. Thus, the current state goes from adapting to learning. Also, we re-estimate the distribution ˆΦtaccording to Section 4.2.2.

When the current state is pruning or adapting, we perform the dictionary adaptation phase, described in Section 4.3. This phase starts by computing the change magnitude γt according to (4.9) in Section 4.3. The change magnitude γt,

scaled by a parameter ρ given as input, is used in the pruning state for operating a selection among the atoms of the current dictionary (line 19), as described in Section 4.3.1. The removal of some atoms, implies the elimination of the relative entries from the memory matrices At−1 and Bt−1. In line 22, we generate a number

of atoms proportional to the change magnitudeγt. These new atoms are inserted

in D_t−1; moreover, we expand the memory matrices A_t−1 and B_t−1 for accounting the presence of new atoms (line 22). Then, in lines 23–24, we recompute the sparse representation Xt and the indicators relative to the modified dictionary

D_t−1. Lines 25–30 are shared between each state. Here, we update A and B for including current batch information, the probability distribution ˆΦ incorporating the current anomaly indicators Gt. The final step is the dictionary update (line 30),

which we perform by optimizing the following objective function:

Dt = argmin D 1 2Tr D T_DA t − Tr DTBt + λg n X j kdjk2 (4.1)

(43)

Algorithm 8: Determine if the adaptation phase needs to continue Input: current batchS _{∈ R}m×η_{; current dictionary D}_{∈ R}m×n_{; current}

sparse representation X_{∈ R}n×η; history of Mean Square Errors collected during adaptation phase MSE _{∈ R}k

Output: True if the adaptation have to continue, False otherwise

1 Function NeedAdaptation(S, D, X, MSE):

2 mse← _η1Pη_i=1ks_i− Dx_ik2₂

3 MSE←MSE mse, k ← k + 1

4 if k > 1 then

// compute slope through linear regression

5 θ← slope(MSE) 6 if θ ≤ 0 then 7 MSE← 0 8 return True 9 else 10 return False 11 return True

The optimization is performed by applying the block-coordinate descent approach summarized in Algorithm 9. Equation (4.1) is similar to (2.11), with the addition of the `1/`2 regularizer proposed in [35]. The purpose of this regularizer is to

remove atoms that, before normalizing it, have the`2-norm of the residual, defined

in line 5, lower than an hyper-parameterλg.

4.2 Change Detection

A dictionary D0 ∈ Rm×n, trained on signals coming from a generative process P0

can represent sparsely and with a lower approximation error other signals coming

from the same process. A change of process _P0 → P1 can lead to a worsening

of the dictionary performance, due to the presence of features in_P1 not sparserly

representable from D0.

Since we operate in a unsupervised setting, we need to determine automatically if the generative process behind an incoming batch S of signals is the same of previous batches. Thus, we decide to extend the method proposed in [3], developed for anomaly detection purposes, for addressing the change detection problem. The idea is to map each signal of an incoming batch to a lower dimensional space by means of indicators. Then, to determine if a change has happened, we monitor the average likelihood of the indicators w.r.t. a probability distribution estimated from indicators of the previous batches. These indicators depend on the dictionary that we are learning; thus, we update the distribution for including most recent indicators.

Adaptive-size online dictionary learning for nonstationary environments

POLITECNICO DI MILANO