Event Oriented Dictionary Learning for Complex Event Detection

(1)

IEEE

Proof

Event Oriented Dictionary Learning

for Complex Event Detection

Yan Yan, Yi Yang, Deyu Meng, Member, IEEE, Gaowen Liu, Wei Tong,

Alexander G. Hauptmann, and Nicu Sebe, Senior Member, IEEE

Abstract— Complex event detection is a retrieval task with

1

the goal of finding videos of a particular event in a large-scale 2

unconstrained Internet video archive, given example videos and 3

text descriptions. Nowadays, different multimodal fusion schemes 4

of low-level and high-level features are extensively investigated 5

and evaluated for the complex event detection task. However, how 6

to effectively select the high-level semantic meaningful concepts 7

from a large pool to assist complex event detection is rarely 8

studied in the literature. In this paper, we propose a novel strategy 9

to automatically select semantic meaningful concepts for the event 10

detection task based on both the events-kit text descriptions 11

and the concepts high-level feature descriptions. Moreover, we 12

introduce a novel event oriented dictionary representation based 13

on the selected semantic concepts. Toward this goal, we leverage 14

training images (frames) of selected concepts from the semantic 15

indexing dataset with a pool of 346 concepts, into a novel 16

supervised multitask p-norm dictionary learning framework.

17

Extensive experimental results on TRECVID multimedia event

AQ:1 18

detection dataset demonstrate the efficacy of our proposed 19

method.

AQ:2

20

Index Terms— Complex event detection, concept selection,

21

event oriented dictionary learning, supervised multi-task 22

dictionary learning. 23

I. INTRODUCTION 24

C

OMPLEX event detection in unconstrained videos has 25

received much attention in the research community 26

recently [1]–[3]. It is a retrieval task with the goal of detecting 27

videos of a particular event in a large-scale internet video 28

archive, given an event-kit. An event-kit consists of example 29

videos and text descriptions of the event. Unlike traditional 30

Manuscript received July 7, 2014; revised October 17, 2014 and February 11, 2015; accepted March 4, 2015. This work was supported in part by the MIUR Cluster Project Active Ageing at Home, in part by the European Commission Project xLiMe, in part by the Australian Research Council Discovery Projects, in part by the U.S. Army Research Office under Grant W911NF-13-1-0277, and in part by the National Science Foundation under Grant IIS-1251187. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gang Hua.

Y. Yan, G. Liu, and N. Sebe are with the Department of Information Engineering and Computer Science, University of Trento, Trento 38123, Italy (e-mail: yan@disi.unitn.it; gaowen.liu@unitn.it; sebe@disi.unitn.it).

Y. Yang is with the Centre for Quantum Computation and Intelligent Systems, University of Technology at Sydney, Sydney, NSW 2007, Australia (e-mail: yee.i.yang@gmail.com).

D. Meng is with the Institute for Information and System Sciences, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: dymeng@mail. xjtu.edu.cn).

W. Tong is with the General Motors Research and Development Center, Warren, MI 48090 USA (e-mail: tongweig@gmail.com).

A. G. Hauptmann is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: alex@cs.cmu.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2413294

action recognition of atomic actions from videos, such as 31 ‘walking’ or ‘jumping’, complex event detection aims to detect 32 more complex events such as ‘Birthday party’, ‘Changing a 33

vehicle tire’, etc. 34

An event is a higher level semantic abstraction of video 35 sequences than a concept and consists of many concepts. 36 For example, a ‘Birthday party’ event can be described by 37 multiple concepts, such as objects (e.g., boy, cake), actions 38 (e.g., talking, walking) and scene (e.g., at home, in a 39 restaurant). A concept can be detected in a shorter video 40 sequence or even in a single frame but an event is usually 41 contained in a longer video clip. 42 Traditional approaches for complex event detection rely 43 on fusing the classification outputs of multiple low-level 44 features [1], i.e. SIFT, STIP, MOSIFT [4]. Recently, 45 representing videos using high-level features, such as concept 46 detectors [5], appears promising for the complex event detec- 47 tion task. However, the state-of-the-art concept detector based 48 approaches for complex event detection have not considered 49 which concepts should be included in the training concept list. 50 This induces the redundancy of concepts [2], [6] in the concept 51 list for the vocabulary construction. For example, it is highly 52 improbable for some concepts to help detecting a certain event, 53

e.g. ‘cows’ or ‘football’ are not helpful to detect events like 54 ‘Landing a fish’ or ‘Working on a sewing project’. Therefore, 55 removing the uncorrelated concepts from the vocabulary 56 construction tends to eliminate such redundancy and 57 potentially boosts the complex event detection performance. 58 Intuitively, it is highly expected that complex event detection 59 is more accurate and faster when we build a specific dictionary 60 representation for each event. In this paper, we investigate 61 how to learn a concept-driven event oriented representation 62 for complex event detection. There are mainly two important 63 issues to be considered for accomplishing this goal. The first 64 issue is which concepts should be included in the vocabulary 65 construction of the learning framework. Since we want to 66 learn an event oriented dictionary representation, how to 67 properly select qualified concepts for each event in the learning 68 framework is the key issue. This raises the problem of how to 69 optimally select the necessary and meaningful concepts from 70 a large pool of concepts for each event. The second issue is 71 how can we design an effective dictionary learning framework 72 to seamlessly learn the common knowledge from both the 73 low-level features and the high-level concept features. 74 To facilitate reading, we first introduce the abbreviations 75 used in the paper. SIN stands for the Sematic Indexing 76 dataset [7] containing 346 different categories (concepts) 77 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

(2)

IEEE

Proof

Fig. 1. The event oriented dictionary learning framework.

of images, such as car, adult, etc. SIN-MED stands for 78

the high-level concept features using the SIN concept list 79

representing each MED video by a 346D feature (each 80

dimension represents a concept). 81

The overview of our framework is shown in Fig. 1. Firstly, 82

we design a novel method to automatically select semantic 83

meaningful concepts for each MED event based on both 84

MED events-kit text descriptions and SIN-MED high-level 85

concept feature representations. Then, we leverage training 86

samples of selected concepts from the SIN dataset into a 87

jointly supervised multi-task dictionary learning framework. 88

An event specific semantic meaningful dictionary is learned 89

through embedding the feature representation of original 90

datasets (both MED dataset and SIN dataset) into a hidden 91

shared subspace. We add label information in the learning 92

framework to facilitate the event oriented dictionary learning 93

process. Therefore, the learned sparse codes achieve intrinsic 94

discriminative information and naturally lead to effective 95

complex event detection. Moreover, a novel p-norm

multi-96

task dictionary learning is proposed to strengthen the flexibility 97

of the traditional1-norm dictionary learning problem. 98

To summarize, the contributions of this paper are as 99

follows: 100

• We propose a novel approach for concept selection and 101

present one of the first works making a comprehensive 102

evaluation of automatic concept selection strategies for 103

event detection; 104

• We propose the event oriented dictionary learning for 105

event detection; 106

• We construct a supervised multi-task dictionary learning 107

framework which is capable of learning an event 108

oriented dictionary via leveraging information from 109

selected semantic concepts; 110

• We propose a novel p-norm multi-task dictionary

111

learning framework which is more flexible than the 112

traditional1-norm dictionary learning problem. 113

II. RELATEDWORK 114 To highlight our research contributions, we now review the 115 related work on (a) Event Detection, (b) Dictionary Learning 116 and (c) Multi-task Learning. 117

A. Event Detection 118

With the success of event detection in structured videos, 119 complex event detection from general unconstrained videos, 120 such as those obtained from internet video sharing web sites 121 like YouTube, has received increasing attention in recent 122 years. Unlike traditional action recognition from videos of 123 atomic actions, such as ‘walking’ or ‘jumping’, complex 124 event detection aims to detect more complex events such 125 as ‘Birthday party’, ‘Attempting board trick’, ‘Changing 126 a vehicle tire’, etc. Tamrakar et al. [1] and Lan et al. [8] 127 evaluated different low-level appearance as well as spatio- 128 temporal features, appropriately quantized and aggregated 129 into Bag-of-Words (BoW) descriptors for NIST TRECVID 130 Multimedia Event Detection. Jiang et al. [9] proposed a 131 method for high-level and low-level feature fusion based on 132 collective classification from three steps which are training a 133 classifier from low-level features, encoding high-level features 134 into graphs, and diffusing the scores on the established graph 135 to obtain the final prediction. Natarajan et al. [10] evaluated 136 a large set of low-level audio and visual features as well as 137 high-level information from object detection, speech and video 138 text OCR for event detection. They combined multiple features 139 using a multi-stage feature fusion strategy with feature level 140 early fusion using multiple kernel learning (MKL) and score 141 level fusion using Bayesian model combination (BayCom) 142 and weighted average fusion using video specific weights. 143 Tang et al. [11] tackled the problem of understanding the 144 temporal structure of complex events in highly varying videos 145 obtained from the Internet. A conditional model was trained 146 in a max-margin framework able to automatically discover 147 discriminative and interesting segments of video, while 148 simultaneously achieving competitive accuracies on difficult 149 detection and recognition tasks. 150 Recently, representing video in terms of multi-model 151 low-level features, e.g. SIFT, STIP, Dense Trajectory, 152 Mel-Frequency Cepstral Coefficients (MFCC), Automatic 153 Speech Recognition (ASR), Optical Character 154 Recognition (OCR), combined with early or late fusion 155 schemes is the state-of-the-art [12] for event detection. 156 Despite of their good performance, low-level features are 157 incapable of capturing the inherent semantic information in an 158 event. Comparatively, high-level concept features were shown 159 to be promising for event detection [5]. High-level concept 160 representation approaches become available nowadays due to 161 the availability of large labeled training collections such as 162 ImageNet and TRECVID. However, currently there are still 163 few research works on how to automatically select useful 164 concepts for event detection. Oh et al. [13] used Latent 165 SVMs for concept weighting and Brown et al. [14] ordered 166 concepts by their discrimination power for each event kit 167 query. Different from [13] and [14], we focus on learning an 168 event oriented dictionary from the perspective of adaptation 169 of the selecting concepts. 170

(3)

IEEE

Proof

B. Dictionary Learning

171

Dictionary learning (also called Sparse Coding) has been 172

shown to be able to find succinct representations of stimuli 173

and model data vectors as a linear combination of a few 174

elements from a dictionary. Dictionary learning has been 175

successfully applied to a variety of problems in computer 176

vision analysis recently. Yang et al. [15] proposed a spatial 177

pyramid matching approach based on SIFT sparse codes 178

for image classification. The method used selective sparse 179

coding instead of the traditional vector quantization to extract 180

salient properties of appearance descriptors of local image 181

patches. Elad and Aharon [16] addressed the image denoising 182

problem, where zero-mean white and homogeneous Gaussian 183

additive noise was to be removed from a given image. 184

The approach taken was based on sparse and redundant 185

representations over trained dictionaries. Using the K-SVD 186

algorithm, the authors obtained a dictionary that described 187

the image content effectively. For the image segmentation 188

problem, Mairal et al. [17] proposed an energy formulation 189

with both sparse reconstruction and class discrimination 190

components, jointly optimized during dictionary learning. 191

The approach improved over the state of the art in image 192

segmentation experiments. 193

Different optimization algorithms have also been proposed 194

to solve dictionary learning problems. Aharon et al. [18] 195

proposed a novel K-SVD algorithm for adapting dictionaries 196

in order to achieve sparse signal representations. K-SVD is 197

an iterative method that alternates between sparse coding of 198

the examples based on the current dictionary and a process of 199

updating the dictionary atoms to better fit the data. The update 200

of the dictionary columns was combined with an update of the 201

sparse representations, thereby accelerating the convergence. 202

Lee et al. [19] presented efficient sparse coding algorithms 203

that were based on iteratively solving two convex optimization 204

problems: an 1-regularized least squares problem and an 205

2-constrained least squares problem. To learn a 206

discriminative dictionary for sparse coding, a label consistent 207

K-SVD (LC-KSVD) algorithm was proposed in [20]. 208

In addition to using class labels of training data, the authors 209

also associated label information with each dictionary item 210

(columns of the dictionary matrix) to enforce discriminability 211

in sparse codes during the dictionary learning process. 212

More specifically, a new label consistent constraint was 213

introduced and combined with the reconstruction error and 214

the classification error to form a unified objective function. 215

To effectively handle very large training sets and dynamic 216

training data changing over time, Mairal et al. [21] proposed 217

an online optimization algorithm for dictionary learning, 218

based on stochastic approximations, which scaled up to large 219

datasets with millions of training samples. 220

However, so far as we know, there is no research work on 221

how to learn the dictionary representation at the event level 222

for event detection and there is no research work on how to 223

simultaneously leverage the semantic information to learn an 224

event oriented dictionary. 225

C. Multi-Task Learning

226

Multi-task learning [22] methods aim to simultaneously 227

learn classification/regression models for a set of related tasks. 228

This typically leads to better models as compared to a learner 229 that does not account for task relationships. To capture the 230 task relatedness from multiple related tasks is to constrain all 231 models to share a common set of features. This motivates the 232 group sparsity, i.e. the 2,p-norm regularized learning [23]. 233 The joint feature learning using 2,p-norm regularization 234 performs well in ideal cases. In practical applications, 235 however, simply using the 2,p-norm regularization may not 236 be effective for dealing with dirty data which may not fall into 237 a single structure. To this end, the dirty model for multi-task 238 learning was proposed in [24]. Another way to capture the task 239 relationship is to constrain the models from different tasks to 240 share a low-dimensional subspace by the trace norm [25]. The 241 assumption that all models share a common low-dimensional 242 subspace is too restrictive in some applications. To this 243 end, an extension that learns incoherent sparse and low-rank 244 patterns simultaneously was proposed in [26]. 245 Many multi-task learning algorithms assume that all learn- 246 ing tasks are related. In practical applications, however, the 247 tasks may exhibit a more sophisticated group structure where 248 the models of tasks from the same group are closer to each 249 other than those from a different group. There have been many 250 works along this line of research [27], [28], known as clustered 251 multi-task learning (CMTL). Moreover, most multi-task learn- 252 ing formulations assume that all tasks are relevant, which is 253 however not the case in many real-world applications. Robust 254 multi-task learning (RMTL) is aimed at identifying irrelevant 255 (outlier) tasks when learning from multiple tasks [29]. 256 However, there is little work on multi-task learning used for 257 dictionary learning problem. The only related theoretical work 258 is that in [30], where only theoretical bounds are provided on 259 evaluating the generalization error of dictionary learning for 260 multi-task learning and transfer learning. Multi-task learning 261 has received considerable attention in the computer vision 262 community and has been successfully applied to many com- 263 puter vision problems, such as image classification [31], head- 264 pose estimation [32], visual tracking [33], multi-view action 265 recognition [34] and egocentric activity recognition [35]. 266 However, to our knowledge, no previous works have 267 considered the problem of complex event detection. 268 III. BUILDING ANEVENTSPECIFICCONCEPTPOOL 269 The concepts, which are related to objects, actions, scenes, 270 attributes, etc. are usually basic elements for the description 271 of an event. Since the availability of large labeled training 272 collections such as ImageNet and TRECVID, there exists 273 a large pool of concept detectors for event descriptions. 274 However, selecting important concepts is the key issue for 275 concept vocabulary construction. For example, the event 276 ‘Landing a fish’ is composed of concepts such as ‘adult’, 277 ‘waterscape’, ‘outdoor’ and ‘fish’. If concepts related to the 278 event can be accurately selected, redundancy and unrelated 279 information will be suppressed, potentially improving the 280 event recognition performance. In order to select useful 281 concepts for the specific event, we propose a novel concept 282 selection strategy based on the combination of text and visual 283 information provided by the event-kit descriptions, which are 284 (i) Text-based semantic relatedness from linguistic knowledge 285

(4)

IEEE

Proof

Fig. 2. Linguistic-based concept selection strategy with an example of

‘E007: Changing a vehicle tire’ in MED event-kit text description and a corresponding example video provided by NIST.

of MED event-kit text description and (ii) Elastic-Net feature 286

selection from visual high-level representation. 287

A. Linguistic: Text-Based Semantic Relatedness

288

The most widely used resources in Natural Language 289

Processing (NLP) to calulate the semantic relateness of 290

concepts are WordNet [36] and Wikipedia [37]. There 291

are detailed event-kit text descriptions for each MED event 292

provided by NIST [38]. In this paper, we explore the semantic 293

similarity between each term in the event-kit text description 294

and the SIN 346 visual concept names based on WordNet. 295

Fig. 2 shows an example of event-kit text description for 296

‘Changing a vehicle tire’. 297

Intuitively, the more information two concepts share in 298

common, the more similar they are, and the information 299

shared by two concepts is indicated by the information 300

content of the concepts that subsume them in the taxonomy. 301

As illustrated in Fig. 2, we calculate the similarity between 302

each term in event-kit text descriptions and the SIN 346 visual 303

concept names based on the similarity measurement proposed 304

in [39]. This measurement defines the similarity of two words 305 w1i andw2 j as: 306 si m(w1i, w2 j) = 2 π(lcs) π(w1i) + π(w2 j) 307

where w1i ∈ {event-kit text descriptions}, 308

i = 1, . . . , Nevent_kit and w2 j ∈ {SIN visual concept 309

names}, j = 1, . . . , 346. lcs denotes the lowest common

310

subsumer of two words in the WordNet hierarchy.π denotes 311

the information content of a word and is computed as 312

Fig. 3. Visual high-level semantic representation with Elastic-Net concept selection.

π(w) = log p(w), where p(w) is the probability of 313 encountering an instance of w in the union of WordNet and 314 event-kit text descriptions. The probability p(w) = freq(w)/N, 315 which can be estimated from the relative frequency of w 316 and all words N in the union of WordNet and event-kit text 317 descriptions [40]. In this way, we expect to properly capture 318 the semantic similarity between subjects (e.g. human, crowd) 319 and objects (e.g. animal, vehicle) based on the WordNet 320 hierarchy. Finally, we construct a 346D event-level feature vec- 321 tor representation for each event (each dimension corresponds 322 to a SIN visual concept name) using the MED event-kit text 323 description. A threshold is set (thr = 0.5 in our experiments) 324 to select useful concepts into our final semantic concept list. 325

B. Visual High-Level Representation: Elastic-Net 326

Concept Selection 327

Concept detectors provide a high-level semantic 328 representation for videos with complex contents, which 329 are beneficial for developing powerful retrieval or filtering 330 systems for consumer media [5]. In our case, we firstly use the 331 SIN dataset to train 346 semantic concept models. We adopt 332 the approach [41] to extract keyframes from the MED dataset. 333 The trained 346 semantic concept models are used to 334 predict the 346 semantic concepts existing in the keyframes 335 of MED videos. Once we have the prediction score of each 336 concept on each keyframe, the keyframe can be represented as 337 a 346D SIN-MED feature indicating the determined concept 338 probabilities. Finally, the video-level SIN-MED feature 339 is computed as the average of keyframe-level SIN-MED 340

features. 341

To select the useful concepts for each specific event, we 342 adopt the Elastic-Net [42] concept selection as illustrated 343 in Fig. 3, given the intuition that the learner generally would 344 like to choose the most representative SIN-MED feature 345 dimensions (concepts) to differentiate events. Elastic-Net is 346

formulated as follows: 347

min

u l − Fu

2_{+ α}

1u1+ α2u2 348 where l = {0, 1}n ∈ Rn indicates the event labels, F∈ Rn×b 349 is the SIN-MED feature matrix (n is the number of samples 350

(5)

IEEE

Proof

Fig. 4. The correlation demonstration between SIN visual concepts.

High correlations between contextually related clusters ‘G4:car’, ‘G7:nature’, ‘G11:urban-scene’ (in red) and negative correlations between contextually unrelated clusters ‘G7:nature’, ‘G8:indoor’ (in blue) can be easily observed. (Figure is best viewed in color and under zoom).

and b is the SIN-MED feature dimension) and u∈ Rb is the 351

parameter to be optimized. Each dimension of u corresponds 352

to one semantic concept if F is the high-level SIN-MED 353

feature. α1 andα2 are the regularization parameters. We use 354

Elastic-Net instead of LASSO due to the high correlation 355

between concepts in the SIN concept lists [7] (see Fig. 4). 356

While LASSO (when α2 = 0) tends to select only a small 357

number of variables from a group and ignore the others, 358

Elastic-Net is capable of automatically taking such correlation 359

information into account through adding a quadratic termu2 360

to the penalty. We can adjust the value of α1 to control 361

the sparsity degree, i.e., how many semantic concepts are 362

selected in our problem. The concepts to be selected are the 363

corresponding dimensions with non-zero vaules of u. 364

C. Union of Selected Concepts

365

To sum up, we form a union of the semantic concepts 366

selected from both text-based semantic relatedness described 367

in section 3.1 and visual high-level semantic representation 368

described in section 3.2 as the final list of selected concepts 369

for each MED event. All the confidence values used are 370

normalized to 0-1 using the Z-score normalization and are 371

used for ranking the concepts. In our paper, we give equal 372

weights to the textual and visual selection methods for the 373

final late fusion of confidence scores. Top 10 ranked concepts 374

are finally selected to adapt semantic information. Since we 375

select the most important concepts from the pool, the top 376

10 ranked concepts are usually already enough to describe 377

the event (see Fig. 8). To evaluate the effectiveness of our 378

proposed strategy, we compare the selected concepts with 379

the groundtruth (we use human labeled concepts list as the 380

groundtruth for each MED event). 381

IV. EVENTORIENTEDDICTIONARYLEARNING 382 After we select semantic meaningful concepts for each 383 event, we can leverage training samples of selected concepts 384 from the SIN dataset into a supervised multi-task dictionary 385 learning framework. In this section, we investigate how to 386 learn an event oriented dictionary representation. To accom- 387 plish this goal, we firstly propose our multi-task ditionary 388 learning framework and then introduce its supervised setting. 389

A. Multi-Task Dictionary Learning 390

Given K tasks (e.g. K = 2 in our case, one task is the 391 MED dataset and the other task is the subset of SIN dataset 392 where samples are collected from specified selected concepts 393 for each event), each task consists of data samples denoted 394 by Xk = {x1_k, x2_k, . . . , x_knk} ∈ Rnk×d, (k = 1, . . . , K ), where 395

xi_k ∈ Rd is a d-dimensional feature vector and nk is the 396 number of samples in the k-th task. We are going to learn a 397 shared subspace across all tasks, obtained by an orthonormal 398 projection W ∈ Rd×s, where s is the dimensionality of the 399 subspace. In this learned subspace, the data distributions from 400 all tasks should be similar to each other. Therefore, we can 401 code all tasks together in the shared subspace and achieve 402 better coding quality. The benefits of this strategy are: (i) we 403 can improve each individual coding quality by transferring 404 knowledge across all tasks. (ii) we can discover the rela- 405 tionship among different datasets via coding analysis. Such 406 a purpose can be realized through the following optimization 407

problem: 408 min Tk,Ck,W,D K k₌₁ Xk− CkTk2_F+ λ1 K k₌₁ Ck1 409 + λ2 K k₌₁ XkW− CkD2_F 410 s.t. ⎧ ⎨ ⎩ WTW= I (Tk)j·(Tk)Tj·≤ 1, ∀ j = 1, . . . , l Dj·DT_j_·≤ 1, ∀ j = 1, . . . , l (1) 411

where Tk ∈ Rl×d is an overcomplete dictionary (l > d) 412 with l prototypes of the k-th task, (Tk)j_· in the constraints 413 denotes the j -th row of Tk, and Ck ∈ Rnk×l corresponds 414 to the sparse representation coefficients of Xk. In the third 415 term of Eqn.(1), Xk is projected by W into the subspace to 416 explore the relationship among different tasks. D ∈ Rl×s is 417 the dictionary learned in the datasets’ shared subspace. Dj· in 418 the constraints denotes the j -th row of D. I is the identity 419 matrix. (·)T denotes the transpose operator. λ1 and λ2 are 420 the regularization parameters. The first constraint guarantees 421 the learned W to be orthonormal, and the second and third 422 constraints prevent the learned dictionary to be arbitrarily 423 large. In our objective function, we learn a dictionary Tk for 424 each task k and one shared dictionary D among k tasks. Since 425 one task in our model uses samples from the SIN dataset 426 of selected semantic meaningful concepts, the shared 427 learned dictionary D is the event oriented dictionary. When 428

λ2= 0, Eqn.(1) reduces to the traditional dictionary learning 429

(6)

IEEE

Proof

B. Supervised Multi-Task Dictionary Learning

431

It is well-known that the traditional dictionary learning 432

framework is not directly available for classification and the 433

learned dictionary has merely been used for signal recon-434

struction [17]. To circumvent this problem, researchers have 435

developed several algorithms to learn a classification-oriented 436

dictionary in a supervised learning fashion by exploring the 437

label information. In this subsection, we extend our proposed 438

multi-task dictionary learning of Eqn.(1) to be suitable for 439

event detection. 440

Assuming that the k-th task has mk classes, the label

441

information of the k-th task is Yk = {yk1, y2k, . . . , y nk k } ∈ 442 Rnk×mk (k = 1, . . . , K ), yi k = [0, . . . , 0, 1, 0, . . . , 0] 443

(the position of non-zero element indicates the class). 444

k ∈ Rl×mk is the parameter of the k-th task classifier. 445

Inspired by [43], we consider the following optimization 446 problem: 447 min Tk,Ck,k,W,D K k=1 Xk− CkTk2_F + λ1 K k=1 Ck1 448 +λ2 K k=1 XkW− CkD2_F + λ3 K k=1 Yk− Ckk2 F 449 s.t. ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ WTW= I (Tk)j·(Tk)Tj_· ≤ 1, ∀ j = 1, . . . , l Dj·DT_j_· ≤ 1, ∀ j = 1, . . . , l (2) 450

Compared with Eqn.(1), we add the last term into Eqn.(2) 451

to enforce the model involving discriminative information 452

for classification. This objective function can simultane-453

ously achieve a desired dictionary with good representation 454

power and support optimal discrimination of the classes for 455

multi-task setting. 456

1) Optimization: To solve the proposed objective problem

457

of Eqn.(2), we adopt the alternating minimization algorithm to 458

optimize it with respect to D, Tk, Ck,kand W respectively 459

in five steps as follows: 460

Step1 (Fixing Tk, Ck, W, Θk, Optimize D): If we

461 stack X= [XT₁, . . . , X_kT]T, C= [CT₁, . . . , CT_k]T, Eqn.(2) is 462 equivalent to: 463 min D K k=1 XkW− CkD2_F = min D XW − CD 2 F 464 s.t. Dj·DTj_· ≤ 1, ∀ j = 1, . . . , l 465

This is equivalent to the dictionary update stage in tradi-466

tional dictionary learning algorithm. We adopt the dictionary 467

update strategy of [21, Algorithm 2] to efficiently solve it. 468

Step2 (Fixing D, Ck, W, Θk, Optimize Tk): Eqn.(2) is

469 equivalent to: 470 min Tk Xk− CkTk2F 471 s.t. (Tk)j_·(Tk)T_j_·≤ 1, ∀ j = 1, . . . , l 472

This is also equivalent to the dictionary update stage in 473

traditional dictionary learning for k tasks. We adopt the 474

dictionary update strategy of [21, Algorithm 2] to efficiently 475

solve it. 476

Step3 (Fixing Tk, W, D, Θk, Optimize Ck): Eqn.(2) is 477

equivalent to: 478 min Ck K k=1 Xk− CkTk2F + λ1 K k=1 Ck1 479 + λ2 K k=1 XkW− CkD2F+ λ3 K k=1 Yk− Ckk2 F 480 This formulation can be decoupled into(n1+n2+. . .+nk) 481

distinct problems: 482 min ci k K k=1 nk i=1 (xi k− c i kTk 2 2+ λ1 ci k 1 483 + λ2xikW− c i kD 2 2+ λ3 yi k− c i kk 2 2) 484

We adopt the Fast Iterative Shrinkage-Thresholding 485 Algorithm (FISTA) [44] to solve the problem. FISTA solves 486 the optimization problems in the form of min

μ f(μ) + r(μ), 487

where f(μ) is convex and smooth, and r(μ) is convex but 488 non-smooth. We adopt FISTA since it is a popular tool 489 for solving many convex smooth/non-smooth problems and 490 its effectiveness has been verified in many applications. In 491 our setting, we denote the smooth term part as f(ci_k) = 492 _xi k− cikTk 2 2+ λ2x i kW− cikD 2 2+λ3y i k− cikk 2 2and the 493 non-smooth term part as g(c_ki) = λ1ci_k

1. 494

Step4 (Fixing D, Ck, W, Tk, Optimize Θk): Eqn.(2) is 495

equivalent to: 496 min k Yk− Ckk2 F 497 Setting _∂∂ k = 0, we obtain k= (C T kCk)−1C T kYk. 498

Step5 (Fixing Tk, Ck, D, Θk, Optimize W): If we 499 stack X= [XT₁, . . . , XT_k]T, C= [CT₁, . . . , CT_k]T, Eqn.(2) is 500 equivalent to: 501 min W K k=1 XkW− CkD2_F = min W XW − CD 2 F 502 s.t. WTW= I 503 Substituting D= (CTC)−1CTXW back into the above 504

function, we achieve 505 min W (I − C(CT C)−1CT)XW 2 F 506 = min W tr(W T XT(I − C(CTC)−1CT)XW) 507 s.t. WTW= I 508 The optimal W is composed of eigenvectors of the matrix 509

XT(I − C(CTC)−1CT)X corresponding to the s smallest 510

eigenvalues. 511

We summarize our algorithm for solving Eqn.(2) 512

(7)

IEEE

Proof

Algorithm 1 Supervised Multi-Task Dictionary Learning

C. Supervised Multi-Task p-Norm Dictionary Learning

514

In the literature it has been shown that using a non-convex 515

p-norm minimization (0 ≤ p < 1) can often yield better

516

results than the convex1-norm minimization. Inspired by this, 517

we extend our supervised multi-task dictionary learning model 518

to a supervised multi-taskp-norm dictionary learning model.

519

Assuming that the k-th task has mk classes, the label

520

information of the k-th task is Yk = {y_k1, y2_k, . . . , yn_kk} ∈ 521

Rnk×mk, (k = 1, . . . , K ), yi

k = [0, . . . , 0, 1, 0, . . . , 0] (the 522

position of non-zero element indicates the class).k∈ Rl×mk 523

is the parameter of the k-th task classifier. We formulate our 524

supervised multi-task p-norm dictionary learning problem

525 as follows: 526 min Tk,Ck,k,W,D K k=1 Xk− CkTk2F + λ1 K k=1 Ckp p 527 +λ2 K k=1 XkW− CkD2F + λ3 K k=1 Yk− Ckk2 F 528 s.t. ⎧ ⎨ ⎩ WTW= I (Tk)j·(Tk)T_j_·≤ 1, ∀ j = 1, . . . , l Dj·DT_j_·≤ 1, ∀ j = 1, . . . , l (3) 529

Compared with Eqn.2, we replace the traditional sparse 530

coding 1-norm termCk1 with the more flexiblep-norm

531

term Ckpp. Since we can adjust the value of p (0≤ p < 1)

532

in our framework, our algorithm is more flexible to control the 533

sparseness of the feature representation, thus usually resulting 534

in better performance than the traditional 1-norm sparse 535

coding. 536

To solve the proposed problem of Eqn.3, we adopt the 537

alternating minimization algorithm to optimize it with respect 538

to D, Tk, Ck, k, W respectively. The updated rules for 539

D, Tk, k, W are the same as Eqn.2, the only difference is 540

in the updated rule of Ck. Various algorithms have been 541

proposed for p-norm non-convex sparse coding [45]–[47].

542

In this paper, we adopt the Generalized Iterated Shrinkage 543

Algorithm (GISA) [48] to solve the proposed problem. 544

Algorithm 2 Supervised Multi-Task p-Norm Dictionary

Learning

TABLE I

18 EVENTS OFMED10ANDMED11

We summarize our algorithm for solving Eqn.3 as 545

Algorithm 2. 546

After the optimized is obtained, the final classification of 547 a test video can be obtained based on its sparse coefficient ci_k, 548 which carries the discriminative information. We can simply 549 apply the linear classifier ci_kk to obtain the predicted score 550

of the video. 551

V. EXPERIMENTS 552

In this section, we conduct extensive experiments to test our 553 proposed method using large-scale real world datasets. 554

A. Datasets 555

TRECVID MED10 (P001-P003) and MED11 (E001-E015) 556 datasets are used in our experiments. The datasets consist of 557 9746 videos from 18 events of interest, with 100-200 examples 558 per event, and the rest of the videos are from the background 559 class. The details are listed in the Table 1. 560 TRECVID Semantic Indexing Task (SIN) [7] contains anno- 561 tation for 346 semantic concepts on 400,000 keyframes from 562 web videos. 346 concepts are related to objects, actions, 563

(8)

IEEE

Proof

TABLE II

15 GROUPS OFSIN 346 VISUALCONCEPTS(THENUMBER OF CONCEPTS FOREACHGROUPARE INPARENTHESIS)

scences, attributes and non-visual concept which are all the 564

basic elements for an event, e.g. kitchen, boy, girl, bus. For 565

the sake of better understanding and easy concept selection, 566

we manually divide the 346 visual concepts into 15 groups 567

which are listed in Table 2. 568

B. Evaluation Metrics

569

1) Average Precision (AP): is a measure combining recall

570

and precision for ranked retrieval results. The average preci-571

sion is the mean of the precision scores after each relevant 572

sample is retrieved. The higher number indicates the better 573

performance. 574

2) PMiss@TER = 12.5: is an official evaluation metric for

575

event detection as defined by NIST [38]. It is defined as the 576

point at which the ratio between the probability of Missed 577

Detection and probability of False Alarm is 12.5:1. The lower 578

number indicates the better performance. 579

3) Normalized Detection Cost (NDC): is an official

evalua-580

tion metric for event detection as defined by NIST [38]. It is a 581

weighted linear combination of the system’s Missed Detection 582

and False Alarm probabilities. NDC measures the performance 583

of a detection system in the context of an application profile 584

using error rate estimates calculated on a test set. The lower 585

number indicates the better performance. 586

C. Experiment Settings and Comparison Methods

587

There are 3104 videos used for training and 6642 videos 588

used for testing in our experiments. We use three representative 589

features which are SIFT, Color SIFT (CSIFT) and Motion 590

SIFT (MOSIFT) [4]. SIFT and CSIFT describe the gradient 591

and color information of images. MOSIFT describes both the 592

optical flow and gradient information of video clips. Finally, 593

768D SIFT-BoW, CSIFT-BoW, MOSIFT-BoW features are 594

extracted respectively to represent each video. We set the reg-595

ularization parameters in the range of {0.01, 0.1, 1, 10, 100}. 596

The subspace dimensionality s is set by searching the grid 597

from {200, 400, 600}. For the experiments in the paper, we 598

try three different dictionary sizes from {768, 1024, 1280}. 599

To evaluate the multi-task p-norm dictionary learning

600

algorithm, the parameter p is tuned in the range of {0.2, 0.4, 601

0.6, 0.8, 1}. We compare our proposed event oriented 602

dictionary learning method with the following important 603

baselines: 604

• Support Vector Machine (SVM): SVM has been widely

605

used by several research groups for MED and has shown 606

its robustness [8], [49], [50], so we use it as one of the 607

comparison algorithms (RBF, Histogram Intersection and 608

χ2 _{kernels are used respectively);} 609

Fig. 5. Example results of semantic concept selection proposed in section 3 on event (top left) Attempting board trick, (top right) Feeding animal, (bottom left) Flash mob gathering, (bottom right) Making a sandwich. The font size in the figure reflects the ranking value for concepts (the larger the font the higher the value).

TABLE III

AP PERFORMANCE FOREACHMED EVENTUSINGTEXT(T), VISUAL(V)ANDTEXT+ VISUAL(T+ V) INFORMATION

FORCONCEPTSELECTION. THELASTCOLUMNSHOWS THENUMBER OFCONCEPTS IN THETOP10 THAT

COINCIDEWITH THEGROUNDTRUTH

• Single Task Supervised Dictionary Learning (ST-SDL): 610 Performing supervised dictionary learning on each task 611

separately; 612

• Pooling Tasks Supervised Dictionary Learning (PT-SDL): 613 Performing single task supervised dictionary learning by 614 simply aggregating data from all tasks; 615

• Multiple Kernel Transfer Learning (MKTL) [51]: A 616 method incorporating prior features into a multiple kernel 617 learning framework. We use the code provided by the 618

author1; 619

• Dirty Model Multi-Task Learning (DMMTL) [24]: A 620 state-of-the-art multi-task learning method imposing 621

1/q-norm regularization. We use the code provided by 622

MALSAR toolbox2; 623

• Multiple Kernel Learnig Latent Variable Approach 624

(MKLLVA) [52]: A multiple kernel learning latent 625 variable approach for complex video event detection; 626

• Random Concept Selection Strategy (RCSS): Performing 627 our proposed supervised multi-task dictionary learning 628

1_{http://homes.esat.kuleuven.be/~ttommasi/source_code_ICCV11.html} 2_{http://www.public.asu.edu/~jye02/Software/MALSAR/}

(9)

IEEE

Proof

TABLE IV

COMPARISON OFAverage DETECTIONACCURACY OFDIFFERENTMETHODS FOR THESIFT FEATURE. BESTRESULTSAREHIGHLIGHTED INBOLD

Fig. 6. Comparison of AP performance of different methods for each MED event. (Figure is best viewed in color and under zoom).

without involving concept selection strategy (leveraging

629

random samples). 630

• Simple Concept Selection Strategy (SCSS): Each concept

631

is independently used as a predictor for each event 632

yielding a certain average precision performance for 633

each event. The concepts are ranked according to the 634

AP measurement for each event, and the top 10 concepts 635

retained. 636

D. Experimental Results

637

We firstly calculate the covariance matrix of the 346 SIN 638

concepts, shown in Fig. 4, where the visual concepts are 639

grouped and sorted the same as in Table 2. As shown in Fig. 4, 640

the convariance matrix shows high within-cluster correlations 641

along the diagonal direction and also relatively high corre-642

lations between contextually related clusters (in red color), 643

such as ‘G4:car’, ‘G7:nature’ and ‘G11:urban-scene’. One can 644

also easily observe negative correlations between contextually 645

unrelated clusters (in blue color), such as ‘G7:nature’ and 646

‘G8:indoor’. This gives us the intuition that visual concepts 647

co-occurrence exists. Therefore, removing redundant concepts 648

and selecting related concepts for each event is expected to be 649

helpful for the event detection task. 650

Fig. 5 shows the results of our concept selection strategy for 651

the event ‘Attempting board trick’, ‘Feeding animal’, ‘Flash 652

mob gathering’ and ‘Making a sandwich’. From Fig. 5, we can 653

observe that the concepts selected are reasonably consistent 654

with human selections. 655

To better exploit the effectiveness of our proposed concept 656

selection strategy, we compare our selected top 10 concepts 657

with the groundtruth (we use the ranking list of human 658

labeled concepts as the groundtruth for each MED event). 659

The results are listed in the last column of Table 3, showing 660

the number of concepts in the top 10 that coincide with the 661

groundtruth. The AP performance for event detection based on 662

text information, visual information and their combinations 663 are also shown in Table 3. The benefit of using both text 664 and visual information for concept selection can be concluded 665

from Table 3. 666

Table 4 shows the average detection results of the 18 MED 667 events for different comparison methods. We have the 668 following observations: (1) Comparing ST-SDL with SVM, 669 we observe that performing supervised dictionary learning is 670 better than SVM which shows the effectiveness of dictionary 671 learning for MED. (2) Comparing PT-SDL with ST-SDL, 672 leveraging knowledge from the SIN dataset improves the 673 performance for MED. (3) Our concept selection strategy 674 for semantic dictionary learning performs the best for 675 MED among all the comparison methods. (4) Our proposed 676 method outperforms by 8%, 6%, 10% with respect to AP, 677 PMiss@TER=12.5 and MinNDC respectively compared with 678 SVM. (5) Considering the difficulty of MED dataset and 679 the typically low AP performance of MED, the absolute 680 8% AP improvement is very significant. 681 Fig. 6 shows the AP results for each MED event. Our 682 proposed method achieves the best performance for 13 events 683 out of a total of 18 events. It is also interesting to notice that 684 the larger improvements in Fig. 6, such as ‘E004: Wedding 685 ceremony’, ‘E005: Working wood working project’ and 686 ‘E009: Getting a vehicle unstuck’ usually correspond to the 687 higher number of selected concepts that coincide with the 688 groundtruth. This gives us the evidence of the effectiveness 689 of our proposed automatic concept selection strategy. 690 Fig. 7 illustrates the MAP performance for different 691 methods based on SIFT, CSIFT and MOSIFT features. 692 It can be easily observed that our proposed surpervised multi- 693 task dictionary learning with our concept selection strategy 694 outperforms SVM by more than 8%. 695 Moreover, we evaluate our proposed method with respect 696 to different numbers of selected concepts, dictionary sizes and 697

(10)

IEEE

Proof

Fig. 7. Comparison of MAP performance of different methods for different

types of features.

Fig. 8. MAP performance variation with respect to the number of selected concepts.

Fig. 9. MAP performance variation with respect to (left) different dictionary size; (right) different subspace dimensionality.

different subspace dimensionality settings based on the SIFT 698

feature. Fig. 8 shows that the proposed method achieves the 699

best MAP results when the numbers of selected concepts is 10. 700

Fig. 9(left) shows that the proposed method achieves the best 701

MAP results when the dictionary size is 1024. Too large or 702

too small dictionary size tends to hamper the performance. 703

Fig. 9(right) shows that the best MAP result is achieved when 704

the subspace dimensionality is 400 (dictionary size = 1024). 705

Large or small subspace dimensionality also degrades the 706

performance. 707

Finally, we also study the parameter sensitivity of the 708

proposed method in Fig. 10. Here, we fix λ3 = 1 709

(discriminative information contribution fixed) and p = 0.6 710

and analyze the regularization parametersλ1andλ2. As shown 711

in Fig. 10(left), we observe that the proposed method is more 712

sensitive to λ2 compared with λ1, which demonstrates the 713

importance of the subspace for multi-task dictionary learning. 714

Moreover, to understand the influence of parameter p for our 715

proposed supervised p-norm dictionary learning algorithm,

716

we also perform an experiment on the parameter sensitivity. 717

Fig. 10(right) demonstrates that the best performance for the 718

Fig. 10. Sensitivity study of parameters on (left)λ1 andλ2 with fixed p

andλ3. (right) p andλ3with fixedλ1andλ2.

Fig. 11. Convergence rate study on (left) supervised multi-task dictionary learning and (right) supervised multi-taskp-norm dictionary learning.

supervisedp-norm dictionary learning algorithm is achieved 719 when p = 0.6. More than 2% MAP can be achieved if we 720 adopt the p-norm model compared with the fixed 1-norm 721 model. This shows the suboptimality of the traditional 722

1-norm sparse coding compared with the flexible p-norm 723

sparse coding. 724

The proposed iterative approaches monotonically decrease 725 the objective function values in Eqn.(2) and Eqn.(3) until 726 convergence. Fig. 11 shows the convergence rate curves of 727 our algorithms. It can be observed that the objective function 728 values converge quickly and our approaches usually converge 729 after 6 iterations for supervised multi-task dictionary learning 730 and 7 iterations for supervised multi-taskp-norm dictionary 731 learning at most (precision= 10−6). 732 Regarding the computational cost of our proposed algorithm 733 for supervised multi-task p-norm dictionary learning, we 734 train our model for TRECVID MED dataset in 5 hours 735 with cross-validation on a workstation with Intel(R) Xeon(R) 736 CPU E5-2620 v2 @ 2.10GHz × 17 processors. Our proposed 737 event oriented dicionary learning approach can be easily 738 parallelled on multi-core computers due to its ‘event 739 oriented’. This means that our algorithms would be scalable 740

for large-scale problems. 741

VI. CONCLUSION ANDFUTUREWORK 742 In this paper, we have firstly investigated the possibility 743 of automatically selecting semantic meaningful concepts for 744 complex event detection based on both the MED events-kit 745 text descriptions and the high-level concept feature 746 descriptions. Then, we attempt to learn an event oriented 747 dictionary representation based on the selected semantic 748 concepts. To this aim, we leverage training samples of 749 selected concepts from the SIN dataset into a novel jointly 750 supervised multi-task dictionary learning framework. 751

(11)

IEEE

Proof

Extensive experimental results on the TRECVID MED

752

dataset showed that our proposed method outperformed several 753

important baseline methods. Our proposed method outper-754

formed SVM by up to 8% MAP which showed the effective-755

ness of dictionary learning for TRECVID MED. More than 756

6% and 3% MAP was achieved respectively compared with 757

ST-SDL and PT-SDL, which showed the advantage of multi-758

task setting in our proposed framework. To show the benefit of 759

concept selection strategy, we compared RCSS to our method 760

and showed that achieves 4% less MAP. 761

For some sparse coding problems, non-convex p-norm

762

minimization (0 ≤ p < 1) can often obtain better results 763

than the convex 1-norm minimization. Inspired by this, we 764

extended our supervised multi-task dictionary learning model 765

to a supervised multi-taskp-norm dictionary learning model.

766

We evaluated the influence of thep-norm parameter p in our

767

proposed problem and found that more than 2% MAP can 768

be achieved if we adopted the more flexible p-norm model

769

compared with the fixed 1-norm model. 770

Overall, the proposed multi-task dictionary learning 771

solutions are novel in the context of complex event detection, 772

which is a relevant and important research problem in 773

applications such as image and video understanding and 774

surveillance. Future research involves (i) integration of 775

knowledge from multiple sources (video, audio, text) and 776

incorporation of kernel learning in our framework, and (ii) the 777

use of deep structures instead of a shallow single-layer model 778

in the proposed problem since deep learning has achieved the 779

supreme success in many different fields of image processing 780

and computer vision. 781

ACKNOWLEDGEMENT 782

Any opinions, findings, and conclusions or recommenda-783

tions expressed in this material are those of the authors and do 784

not necessarily reflect the views of ARO, the National Science 785

Foundation or the U.S. Government. 786

REFERENCES 787

[1] A. Tamrakar et al., “Evaluation of low-level features and their combina-788

tions for complex event detection in open source videos,” in Proc. IEEE 789

Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3681–3688.

790

[2] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann, 791

“Complex event detection via multi-source video attributes,” in Proc. 792

IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2627–2633.

793

[3] Y. Yan, Y. Yang, H. Shen, D. Meng, G. Liu, and N. S. A. Hauptmann, 794

“Complex event detection via event oriented dictionary learning,” in 795

Proc. AAAI Conf. Artif. Intell., 2015.

AQ:3 796

[4] M.-Y. Chen and A. Hauptmann, “MoSIFT: Recognizing human actions 797

in surveillance videos,” School Comput. Sci., Carnegie Mellon Univ., 798

Pittsburgh, PA, USA, Tech. Rep. CMU-CS-09-161, 2009. 799

[5] C. G. M. Snoek and A. W. M. Smeulders, “Visual-concept search 800

solved?” Computer, vol. 43, no. 6, pp. 76–78, Jun. 2010. 801

[6] C. Sun and R. Nevatia, “ACTIVE: Activity concept transitions in video 802

event classification,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, 803

pp. 913–920. 804

[7] SIN. (2013). NIST TRECVID Semantic Indexing. [Online]. Available: 805

http://www-nlpir.nist.gov/projects/tv2013/tv2013.html#sin 806

[8] Z.-Z. Lan et al., “Informedia E-lamp @ TRECVID 2013 multimedia 807

event detection and recounting (MED and MER),” in Proc. NIST 808

TRECVID Workshop, 2013.

809

[9] L. Jiang, A. G. Hauptmann, and G. Xiang, “Leveraging high-level and 810

low-level features for multimedia event detection,” in Proc. ACM Int. 811

Conf. Multimedia, 2012, pp. 449–458.

812

[10] P. Natarajan et al., “Multimodal feature fusion for robust event detection 813 in web videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 814

Jun. 2012, pp. 1298–1305. 815

[11] K. Tang, D. Koller, and L. Fei-Fei, “Learning latent temporal structure 816 for complex event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern 817

Recognit., Jun. 2012, pp. 1250–1257. 818 [12] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann, “Double 819 fusion for multimedia event detection,” in Proc. Int. Conf. Multimedia 820

Modeling, 2012, pp. 173–185. 821 [13] S. Oh et al., “Multimedia event detection with multimodal feature fusion 822 and temporal concept localization,” Mach. Vis. Appl., vol. 25, no. 1, 823

pp. 49–69, Jan. 2014. 824

[14] L. Brown et al., “IBM Research and Columbia University 825 TRECVID-2013 multimedia event detection (MED), multimedia event 826 recounting (MER), surveillance event detection (SED), and semantic 827 indexing (SIN) systems,” in Proc. NIST TRECVID Workshop, 2013. 828 [15] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid 829 matching using sparse coding for image classification,” in Proc. IEEE 830

Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. 831 [16] M. Elad and M. Aharon, “Image denoising via sparse and redundant 832 representations over learned dictionaries,” IEEE Trans. Image Process., 833 vol. 15, no. 12, pp. 3736–3745, Dec. 2006. 834 [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina- 835 tive learned dictionaries for local image analysis,” in Proc. IEEE Conf. 836

Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. 837 [18] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for 838 designing overcomplete dictionaries for sparse representation,” IEEE 839

Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. 840 [19] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding 841 algorithms,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2006, 842

pp. 801–808. 843

[20] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary 844 for sparse coding via label consistent K-SVD,” in Proc. IEEE Conf. 845

Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1697–1704. 846 [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning 847 for sparse coding,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, 848

pp. 689–696. 849

[22] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in 850

Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, 851

pp. 109–117. 852

[23] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” 853 in Proc. Conf. Adv. Neural Inf. Process. Syst., 2007, pp. 1–8. 854 [24] A. Jalali, P. K. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model 855 for multi-task learning,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 856

2010. 857

[25] S. Ji and J. Ye, “An accelerated gradient method for trace norm 858 minimization,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 457–464. 859 [26] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low-rank 860 patterns from multiple tasks,” in Proc. ACM SIGKDD Int. Conf. Knowl. 861

Discovery Data Mining, 2010, p. 22. 862 [27] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learning: 863 A convex formulation,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 864

2008. 865

[28] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternating 866 structure optimization,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 867

2011, pp. 702–710. 868

[29] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse 869 structures for robust multi-task learning,” in Proc. ACM SIGKDD Int. 870

Conf. Knowl. Discovery Data Mining, 2011, pp. 42–50. 871 [30] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for 872 multitask and transfer learning,” in Proc. Int. Conf. Mach. Learn., 2013, 873

pp. 343–351. 874

[31] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse 875 representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 876

Jun. 2010, pp. 3493–3500. 877

[32] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe, “No matter 878 where you are: Flexible graph-guided multi-task learning for multi-view 879 head pose classification under target motion,” in Proc. IEEE Int. Conf. 880

Comput. Vis., Dec. 2013, pp. 1177–1184. 881 [33] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via 882 multi-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. Pattern 883

Recognit., Jun. 2012, pp. 2042–2049. 884 [34] Y. Yan, E. Ricci, R. Subramanian, G. Liu, and N. Sebe, “Multitask linear 885 discriminant analysis for view invariant action recognition,” IEEE Trans. 886

(12)

IEEE

Proof

[35] Y. Yan, E. Ricci, G. Liu, and N. Sebe, “Recognizing daily activities

888

from first-person videos with multi-task clustering,” in Proc. Asian Conf. 889

Comput. Vis., 2014, pp. 1–16.

890

[36] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. Cambridge, 891

MA, USA: MIT Press, 1998. 892

[37] M. Strube and S. P. Ponzetto, “WikiRelate! Computing semantic related-893

ness using Wikipedia,” in Proc. AAAI Conf. Artif. Intell., 2006, pp. 1–6. 894

[38] NIST. (2013). NIST TRECVID Multimedia Event Detection. [Online]. 895

Available: http://www.nist.gov/itl/iad/mig/med13.cfm 896

[39] D. Lin, “An information-theoretic definition of similarity,” in Proc. 5th 897

Int. Conf. Mach. Learn., 1998, pp. 296–304.

898

[40] P. Resnik, “Using information content to evaluate semantic similarity 899

in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, 900

pp. 448–453. 901

[41] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically 902

meaningful key frames from personal video clips: From humans to 903

computers,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 2, 904

pp. 289–301, Feb. 2009. 905

[42] H. Zou and T. Hastie, “Regularization and variable selection via the 906

elastic net,” J. Roy. Statist. Soc., Ser. B, vol. 67, no. 2, pp. 301–320, 907

Apr. 2005. 908

[43] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in 909

face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 910

Jun. 2010, pp. 2691–2698. 911

[44] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding 912

algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, 913

pp. 183–202, 2009. 914

[45] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from 915

limited data using FOCUSS: A re-weighted minimum norm algorithm,” 916

IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997.

917

[46] Y. She, “Thresholding-based iterative selection procedures for model 918

selection and shrinkage,” Electron. J. Statist., vol. 3, pp. 384–415, 919

Nov. 2009. 920

[47] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-921

Laplacian priors,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2009, 922

pp. 1033–1041. 923

[48] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang, “A generalized 924

iterated shrinkage algorithm for non-convex sparse coding,” in Proc. 925

IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 217–224.

926

[49] L. Cao et al., “IBM Research and Columbia University TRECVID-2011 927

multimedia event detection (MED) system,” in Proc. NIST TRECVID 928

Workshop, Dec. 2011.

929

[50] D. Oneata et al., “AXES at TRECVid 2012: KIS, INS, and MED,” in 930

Proc. NIST TRECVID Workshop, 2012.

931

[51] L. Jie, T. Tommasi, and B. Caputo, “Multiclass transfer learning from 932

unconstrained priors,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, 933

pp. 1863–1870. 934

[52] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim, “Compositional 935

models for video event detection: A multiple kernel learning latent 936

variable approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, 937

pp. 1185–1192. 938

Yan Yan received the Ph.D. degree from the 939

University of Trento, in 2014. He is currently a 940

Post-Doctoral Researcher with the MHUG Group, 941

University of Trento. His research interests include 942

machine learning and its application to computer 943

vision and multimedia analysis. 944

Yi Yang received the Ph.D. degree in computer 945

science from Zhejiang University. He was a 946

Post-Doctoral Research Fellow with the School 947

of Computer Science, Carnegie Mellon University, 948

from 2011 to 2013. He is currently a Senior 949

Lecturer with the Centre for Quantum Computation 950

and Intelligent Systems, University of Technology 951

at Sydney, Sydney. His research interests include 952

multimedia and computer vision. 953

Deyu Meng (M’13) received the B.Sc., M.Sc., 954 and Ph.D. degrees from Xi’an Jiaotong University, 955 Xi’an, China, in 2001, 2004, and 2008, respectively. 956 He was a Visiting Scholar with Carnegie Mellon 957 University, Pittsburgh, PA, USA, from 2012 to 2014. 958 He is currently an Associate Professor with the 959 Institute for Information and System Sciences, 960 School of Mathematics and Statistics, Xi’an 961 Jiaotong University. His current research interests 962 include principal component analysis, nonlinear 963 dimensionality reduction, feature extraction and 964 selection, compressed sensing, and sparse machine learning methods. 965

Gaowen Liu received the B.S. degree in automation 966 from Qingdao Univerisity, China, in 2006, and 967 the M.S. degree in system engineering from the 968 Nanjing University of Science and Technology, 969 China, in 2008. She is currently pursuing the 970 Ph.D. degree with the MHUG Group, University of 971 Trento, Italy. Her research interests include machine 972 learning and its application to computer vision and 973

multimedia analysis. 974

Wei Tong received the Ph.D. degree in computer 975 science from Michigan State University, in 2010. 976 He is currently a Researcher with General Motors 977 Research and Development Center. His research 978 interests include machine learning, computer vision, 979 and autonomous driving. 980

Alexander G. Hauptmann received the B.A. and 981 M.A. degrees in psychology from Johns Hopkins 982 University, Baltimore, MD, the degree in computer 983 science from the Technische Universität Berlin, 984 Berlin, Germany, in 1984, and the Ph.D. degree 985 in computer science from Carnegie Mellon 986 University (CMU), Pittsburgh, PA, in 1991. 987 He worked on speech and machine translation 988 from 1984 to 1994, when he joined the Informedia 989 project for digital video analysis and retrieval, 990 and led the development and evaluation of 991 news-on-demand applications. He is currently a Faculty Member with 992 the Department of Computer Science and the Language Technologies 993 Institute, CMU. His research interests include several different areas, 994 including man–machine communication, natural language processing, speech 995 understanding and synthesis, video analysis, and machine learning. 996

AQ:4

Nicu Sebe (M’01–SM’11) received the Ph.D. degree 997 from Leiden University, The Netherlands, in 2001. 998 He is currently with the Department of Information 999 Engineering and Computer Science, University of 1000 Trento, Italy, where he leads the research in the 1001 areas of multimedia information retrieval and human 1002 behavior understanding. He is a Senior Member of 1003 the Association for Computing Machinery and a 1004 fellow of the International Association for Pattern 1005 Recognition. He was the General Co-Chair of 1006 FG 2008 and ACM Multimedia 2013, and the 1007 Program Chair of CIVR 2007 and 2010, and ACM Multimedia 2007 and 2011. 1008 He will be the Program Chair of ECCV 2016 and ICCV 2017. 1009