IEEE
Proof
Event Oriented Dictionary Learning
for Complex Event Detection
Yan Yan, Yi Yang, Deyu Meng, Member, IEEE, Gaowen Liu, Wei Tong,
Alexander G. Hauptmann, and Nicu Sebe, Senior Member, IEEE
Abstract— Complex event detection is a retrieval task with
1
the goal of finding videos of a particular event in a large-scale 2
unconstrained Internet video archive, given example videos and 3
text descriptions. Nowadays, different multimodal fusion schemes 4
of low-level and high-level features are extensively investigated 5
and evaluated for the complex event detection task. However, how 6
to effectively select the high-level semantic meaningful concepts 7
from a large pool to assist complex event detection is rarely 8
studied in the literature. In this paper, we propose a novel strategy 9
to automatically select semantic meaningful concepts for the event 10
detection task based on both the events-kit text descriptions 11
and the concepts high-level feature descriptions. Moreover, we 12
introduce a novel event oriented dictionary representation based 13
on the selected semantic concepts. Toward this goal, we leverage 14
training images (frames) of selected concepts from the semantic 15
indexing dataset with a pool of 346 concepts, into a novel 16
supervised multitask p-norm dictionary learning framework.
17
Extensive experimental results on TRECVID multimedia event
AQ:1 18
detection dataset demonstrate the efficacy of our proposed 19
method.
AQ:2
20
Index Terms— Complex event detection, concept selection,
21
event oriented dictionary learning, supervised multi-task 22
dictionary learning. 23
I. INTRODUCTION 24
C
OMPLEX event detection in unconstrained videos has 25received much attention in the research community 26
recently [1]–[3]. It is a retrieval task with the goal of detecting 27
videos of a particular event in a large-scale internet video 28
archive, given an event-kit. An event-kit consists of example 29
videos and text descriptions of the event. Unlike traditional 30
Manuscript received July 7, 2014; revised October 17, 2014 and February 11, 2015; accepted March 4, 2015. This work was supported in part by the MIUR Cluster Project Active Ageing at Home, in part by the European Commission Project xLiMe, in part by the Australian Research Council Discovery Projects, in part by the U.S. Army Research Office under Grant W911NF-13-1-0277, and in part by the National Science Foundation under Grant IIS-1251187. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gang Hua.
Y. Yan, G. Liu, and N. Sebe are with the Department of Information Engineering and Computer Science, University of Trento, Trento 38123, Italy (e-mail: yan@disi.unitn.it; gaowen.liu@unitn.it; sebe@disi.unitn.it).
Y. Yang is with the Centre for Quantum Computation and Intelligent Systems, University of Technology at Sydney, Sydney, NSW 2007, Australia (e-mail: yee.i.yang@gmail.com).
D. Meng is with the Institute for Information and System Sciences, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: dymeng@mail. xjtu.edu.cn).
W. Tong is with the General Motors Research and Development Center, Warren, MI 48090 USA (e-mail: tongweig@gmail.com).
A. G. Hauptmann is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: alex@cs.cmu.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2015.2413294
action recognition of atomic actions from videos, such as 31 ‘walking’ or ‘jumping’, complex event detection aims to detect 32 more complex events such as ‘Birthday party’, ‘Changing a 33
vehicle tire’, etc. 34
An event is a higher level semantic abstraction of video 35 sequences than a concept and consists of many concepts. 36 For example, a ‘Birthday party’ event can be described by 37 multiple concepts, such as objects (e.g., boy, cake), actions 38 (e.g., talking, walking) and scene (e.g., at home, in a 39 restaurant). A concept can be detected in a shorter video 40 sequence or even in a single frame but an event is usually 41 contained in a longer video clip. 42 Traditional approaches for complex event detection rely 43 on fusing the classification outputs of multiple low-level 44 features [1], i.e. SIFT, STIP, MOSIFT [4]. Recently, 45 representing videos using high-level features, such as concept 46 detectors [5], appears promising for the complex event detec- 47 tion task. However, the state-of-the-art concept detector based 48 approaches for complex event detection have not considered 49 which concepts should be included in the training concept list. 50 This induces the redundancy of concepts [2], [6] in the concept 51 list for the vocabulary construction. For example, it is highly 52 improbable for some concepts to help detecting a certain event, 53
e.g. ‘cows’ or ‘football’ are not helpful to detect events like 54 ‘Landing a fish’ or ‘Working on a sewing project’. Therefore, 55 removing the uncorrelated concepts from the vocabulary 56 construction tends to eliminate such redundancy and 57 potentially boosts the complex event detection performance. 58 Intuitively, it is highly expected that complex event detection 59 is more accurate and faster when we build a specific dictionary 60 representation for each event. In this paper, we investigate 61 how to learn a concept-driven event oriented representation 62 for complex event detection. There are mainly two important 63 issues to be considered for accomplishing this goal. The first 64 issue is which concepts should be included in the vocabulary 65 construction of the learning framework. Since we want to 66 learn an event oriented dictionary representation, how to 67 properly select qualified concepts for each event in the learning 68 framework is the key issue. This raises the problem of how to 69 optimally select the necessary and meaningful concepts from 70 a large pool of concepts for each event. The second issue is 71 how can we design an effective dictionary learning framework 72 to seamlessly learn the common knowledge from both the 73 low-level features and the high-level concept features. 74 To facilitate reading, we first introduce the abbreviations 75 used in the paper. SIN stands for the Sematic Indexing 76 dataset [7] containing 346 different categories (concepts) 77 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
IEEE
Proof
Fig. 1. The event oriented dictionary learning framework.of images, such as car, adult, etc. SIN-MED stands for 78
the high-level concept features using the SIN concept list 79
representing each MED video by a 346D feature (each 80
dimension represents a concept). 81
The overview of our framework is shown in Fig. 1. Firstly, 82
we design a novel method to automatically select semantic 83
meaningful concepts for each MED event based on both 84
MED events-kit text descriptions and SIN-MED high-level 85
concept feature representations. Then, we leverage training 86
samples of selected concepts from the SIN dataset into a 87
jointly supervised multi-task dictionary learning framework. 88
An event specific semantic meaningful dictionary is learned 89
through embedding the feature representation of original 90
datasets (both MED dataset and SIN dataset) into a hidden 91
shared subspace. We add label information in the learning 92
framework to facilitate the event oriented dictionary learning 93
process. Therefore, the learned sparse codes achieve intrinsic 94
discriminative information and naturally lead to effective 95
complex event detection. Moreover, a novel p-norm
multi-96
task dictionary learning is proposed to strengthen the flexibility 97
of the traditional1-norm dictionary learning problem. 98
To summarize, the contributions of this paper are as 99
follows: 100
• We propose a novel approach for concept selection and 101
present one of the first works making a comprehensive 102
evaluation of automatic concept selection strategies for 103
event detection; 104
• We propose the event oriented dictionary learning for 105
event detection; 106
• We construct a supervised multi-task dictionary learning 107
framework which is capable of learning an event 108
oriented dictionary via leveraging information from 109
selected semantic concepts; 110
• We propose a novel p-norm multi-task dictionary
111
learning framework which is more flexible than the 112
traditional1-norm dictionary learning problem. 113
II. RELATEDWORK 114 To highlight our research contributions, we now review the 115 related work on (a) Event Detection, (b) Dictionary Learning 116 and (c) Multi-task Learning. 117
A. Event Detection 118
With the success of event detection in structured videos, 119 complex event detection from general unconstrained videos, 120 such as those obtained from internet video sharing web sites 121 like YouTube, has received increasing attention in recent 122 years. Unlike traditional action recognition from videos of 123 atomic actions, such as ‘walking’ or ‘jumping’, complex 124 event detection aims to detect more complex events such 125 as ‘Birthday party’, ‘Attempting board trick’, ‘Changing 126 a vehicle tire’, etc. Tamrakar et al. [1] and Lan et al. [8] 127 evaluated different low-level appearance as well as spatio- 128 temporal features, appropriately quantized and aggregated 129 into Bag-of-Words (BoW) descriptors for NIST TRECVID 130 Multimedia Event Detection. Jiang et al. [9] proposed a 131 method for high-level and low-level feature fusion based on 132 collective classification from three steps which are training a 133 classifier from low-level features, encoding high-level features 134 into graphs, and diffusing the scores on the established graph 135 to obtain the final prediction. Natarajan et al. [10] evaluated 136 a large set of low-level audio and visual features as well as 137 high-level information from object detection, speech and video 138 text OCR for event detection. They combined multiple features 139 using a multi-stage feature fusion strategy with feature level 140 early fusion using multiple kernel learning (MKL) and score 141 level fusion using Bayesian model combination (BayCom) 142 and weighted average fusion using video specific weights. 143 Tang et al. [11] tackled the problem of understanding the 144 temporal structure of complex events in highly varying videos 145 obtained from the Internet. A conditional model was trained 146 in a max-margin framework able to automatically discover 147 discriminative and interesting segments of video, while 148 simultaneously achieving competitive accuracies on difficult 149 detection and recognition tasks. 150 Recently, representing video in terms of multi-model 151 low-level features, e.g. SIFT, STIP, Dense Trajectory, 152 Mel-Frequency Cepstral Coefficients (MFCC), Automatic 153 Speech Recognition (ASR), Optical Character 154 Recognition (OCR), combined with early or late fusion 155 schemes is the state-of-the-art [12] for event detection. 156 Despite of their good performance, low-level features are 157 incapable of capturing the inherent semantic information in an 158 event. Comparatively, high-level concept features were shown 159 to be promising for event detection [5]. High-level concept 160 representation approaches become available nowadays due to 161 the availability of large labeled training collections such as 162 ImageNet and TRECVID. However, currently there are still 163 few research works on how to automatically select useful 164 concepts for event detection. Oh et al. [13] used Latent 165 SVMs for concept weighting and Brown et al. [14] ordered 166 concepts by their discrimination power for each event kit 167 query. Different from [13] and [14], we focus on learning an 168 event oriented dictionary from the perspective of adaptation 169 of the selecting concepts. 170
IEEE
Proof
B. Dictionary Learning
171
Dictionary learning (also called Sparse Coding) has been 172
shown to be able to find succinct representations of stimuli 173
and model data vectors as a linear combination of a few 174
elements from a dictionary. Dictionary learning has been 175
successfully applied to a variety of problems in computer 176
vision analysis recently. Yang et al. [15] proposed a spatial 177
pyramid matching approach based on SIFT sparse codes 178
for image classification. The method used selective sparse 179
coding instead of the traditional vector quantization to extract 180
salient properties of appearance descriptors of local image 181
patches. Elad and Aharon [16] addressed the image denoising 182
problem, where zero-mean white and homogeneous Gaussian 183
additive noise was to be removed from a given image. 184
The approach taken was based on sparse and redundant 185
representations over trained dictionaries. Using the K-SVD 186
algorithm, the authors obtained a dictionary that described 187
the image content effectively. For the image segmentation 188
problem, Mairal et al. [17] proposed an energy formulation 189
with both sparse reconstruction and class discrimination 190
components, jointly optimized during dictionary learning. 191
The approach improved over the state of the art in image 192
segmentation experiments. 193
Different optimization algorithms have also been proposed 194
to solve dictionary learning problems. Aharon et al. [18] 195
proposed a novel K-SVD algorithm for adapting dictionaries 196
in order to achieve sparse signal representations. K-SVD is 197
an iterative method that alternates between sparse coding of 198
the examples based on the current dictionary and a process of 199
updating the dictionary atoms to better fit the data. The update 200
of the dictionary columns was combined with an update of the 201
sparse representations, thereby accelerating the convergence. 202
Lee et al. [19] presented efficient sparse coding algorithms 203
that were based on iteratively solving two convex optimization 204
problems: an 1-regularized least squares problem and an 205
2-constrained least squares problem. To learn a 206
discriminative dictionary for sparse coding, a label consistent 207
K-SVD (LC-KSVD) algorithm was proposed in [20]. 208
In addition to using class labels of training data, the authors 209
also associated label information with each dictionary item 210
(columns of the dictionary matrix) to enforce discriminability 211
in sparse codes during the dictionary learning process. 212
More specifically, a new label consistent constraint was 213
introduced and combined with the reconstruction error and 214
the classification error to form a unified objective function. 215
To effectively handle very large training sets and dynamic 216
training data changing over time, Mairal et al. [21] proposed 217
an online optimization algorithm for dictionary learning, 218
based on stochastic approximations, which scaled up to large 219
datasets with millions of training samples. 220
However, so far as we know, there is no research work on 221
how to learn the dictionary representation at the event level 222
for event detection and there is no research work on how to 223
simultaneously leverage the semantic information to learn an 224
event oriented dictionary. 225
C. Multi-Task Learning
226
Multi-task learning [22] methods aim to simultaneously 227
learn classification/regression models for a set of related tasks. 228
This typically leads to better models as compared to a learner 229 that does not account for task relationships. To capture the 230 task relatedness from multiple related tasks is to constrain all 231 models to share a common set of features. This motivates the 232 group sparsity, i.e. the 2,p-norm regularized learning [23]. 233 The joint feature learning using 2,p-norm regularization 234 performs well in ideal cases. In practical applications, 235 however, simply using the 2,p-norm regularization may not 236 be effective for dealing with dirty data which may not fall into 237 a single structure. To this end, the dirty model for multi-task 238 learning was proposed in [24]. Another way to capture the task 239 relationship is to constrain the models from different tasks to 240 share a low-dimensional subspace by the trace norm [25]. The 241 assumption that all models share a common low-dimensional 242 subspace is too restrictive in some applications. To this 243 end, an extension that learns incoherent sparse and low-rank 244 patterns simultaneously was proposed in [26]. 245 Many multi-task learning algorithms assume that all learn- 246 ing tasks are related. In practical applications, however, the 247 tasks may exhibit a more sophisticated group structure where 248 the models of tasks from the same group are closer to each 249 other than those from a different group. There have been many 250 works along this line of research [27], [28], known as clustered 251 multi-task learning (CMTL). Moreover, most multi-task learn- 252 ing formulations assume that all tasks are relevant, which is 253 however not the case in many real-world applications. Robust 254 multi-task learning (RMTL) is aimed at identifying irrelevant 255 (outlier) tasks when learning from multiple tasks [29]. 256 However, there is little work on multi-task learning used for 257 dictionary learning problem. The only related theoretical work 258 is that in [30], where only theoretical bounds are provided on 259 evaluating the generalization error of dictionary learning for 260 multi-task learning and transfer learning. Multi-task learning 261 has received considerable attention in the computer vision 262 community and has been successfully applied to many com- 263 puter vision problems, such as image classification [31], head- 264 pose estimation [32], visual tracking [33], multi-view action 265 recognition [34] and egocentric activity recognition [35]. 266 However, to our knowledge, no previous works have 267 considered the problem of complex event detection. 268 III. BUILDING ANEVENTSPECIFICCONCEPTPOOL 269 The concepts, which are related to objects, actions, scenes, 270 attributes, etc. are usually basic elements for the description 271 of an event. Since the availability of large labeled training 272 collections such as ImageNet and TRECVID, there exists 273 a large pool of concept detectors for event descriptions. 274 However, selecting important concepts is the key issue for 275 concept vocabulary construction. For example, the event 276 ‘Landing a fish’ is composed of concepts such as ‘adult’, 277 ‘waterscape’, ‘outdoor’ and ‘fish’. If concepts related to the 278 event can be accurately selected, redundancy and unrelated 279 information will be suppressed, potentially improving the 280 event recognition performance. In order to select useful 281 concepts for the specific event, we propose a novel concept 282 selection strategy based on the combination of text and visual 283 information provided by the event-kit descriptions, which are 284 (i) Text-based semantic relatedness from linguistic knowledge 285
IEEE
Proof
Fig. 2. Linguistic-based concept selection strategy with an example of‘E007: Changing a vehicle tire’ in MED event-kit text description and a corresponding example video provided by NIST.
of MED event-kit text description and (ii) Elastic-Net feature 286
selection from visual high-level representation. 287
A. Linguistic: Text-Based Semantic Relatedness
288
The most widely used resources in Natural Language 289
Processing (NLP) to calulate the semantic relateness of 290
concepts are WordNet [36] and Wikipedia [37]. There 291
are detailed event-kit text descriptions for each MED event 292
provided by NIST [38]. In this paper, we explore the semantic 293
similarity between each term in the event-kit text description 294
and the SIN 346 visual concept names based on WordNet. 295
Fig. 2 shows an example of event-kit text description for 296
‘Changing a vehicle tire’. 297
Intuitively, the more information two concepts share in 298
common, the more similar they are, and the information 299
shared by two concepts is indicated by the information 300
content of the concepts that subsume them in the taxonomy. 301
As illustrated in Fig. 2, we calculate the similarity between 302
each term in event-kit text descriptions and the SIN 346 visual 303
concept names based on the similarity measurement proposed 304
in [39]. This measurement defines the similarity of two words 305 w1i andw2 j as: 306 si m(w1i, w2 j) = 2 π(lcs) π(w1i) + π(w2 j) 307
where w1i ∈ {event-kit text descriptions}, 308
i = 1, . . . , Nevent_kit and w2 j ∈ {SIN visual concept 309
names}, j = 1, . . . , 346. lcs denotes the lowest common
310
subsumer of two words in the WordNet hierarchy.π denotes 311
the information content of a word and is computed as 312
Fig. 3. Visual high-level semantic representation with Elastic-Net concept selection.
π(w) = log p(w), where p(w) is the probability of 313 encountering an instance of w in the union of WordNet and 314 event-kit text descriptions. The probability p(w) = freq(w)/N, 315 which can be estimated from the relative frequency of w 316 and all words N in the union of WordNet and event-kit text 317 descriptions [40]. In this way, we expect to properly capture 318 the semantic similarity between subjects (e.g. human, crowd) 319 and objects (e.g. animal, vehicle) based on the WordNet 320 hierarchy. Finally, we construct a 346D event-level feature vec- 321 tor representation for each event (each dimension corresponds 322 to a SIN visual concept name) using the MED event-kit text 323 description. A threshold is set (thr = 0.5 in our experiments) 324 to select useful concepts into our final semantic concept list. 325
B. Visual High-Level Representation: Elastic-Net 326
Concept Selection 327
Concept detectors provide a high-level semantic 328 representation for videos with complex contents, which 329 are beneficial for developing powerful retrieval or filtering 330 systems for consumer media [5]. In our case, we firstly use the 331 SIN dataset to train 346 semantic concept models. We adopt 332 the approach [41] to extract keyframes from the MED dataset. 333 The trained 346 semantic concept models are used to 334 predict the 346 semantic concepts existing in the keyframes 335 of MED videos. Once we have the prediction score of each 336 concept on each keyframe, the keyframe can be represented as 337 a 346D SIN-MED feature indicating the determined concept 338 probabilities. Finally, the video-level SIN-MED feature 339 is computed as the average of keyframe-level SIN-MED 340
features. 341
To select the useful concepts for each specific event, we 342 adopt the Elastic-Net [42] concept selection as illustrated 343 in Fig. 3, given the intuition that the learner generally would 344 like to choose the most representative SIN-MED feature 345 dimensions (concepts) to differentiate events. Elastic-Net is 346
formulated as follows: 347
min
u l − Fu
2+ α
1u1+ α2u2 348 where l = {0, 1}n ∈ Rn indicates the event labels, F∈ Rn×b 349 is the SIN-MED feature matrix (n is the number of samples 350
IEEE
Proof
Fig. 4. The correlation demonstration between SIN visual concepts.High correlations between contextually related clusters ‘G4:car’, ‘G7:nature’, ‘G11:urban-scene’ (in red) and negative correlations between contextually unrelated clusters ‘G7:nature’, ‘G8:indoor’ (in blue) can be easily observed. (Figure is best viewed in color and under zoom).
and b is the SIN-MED feature dimension) and u∈ Rb is the 351
parameter to be optimized. Each dimension of u corresponds 352
to one semantic concept if F is the high-level SIN-MED 353
feature. α1 andα2 are the regularization parameters. We use 354
Elastic-Net instead of LASSO due to the high correlation 355
between concepts in the SIN concept lists [7] (see Fig. 4). 356
While LASSO (when α2 = 0) tends to select only a small 357
number of variables from a group and ignore the others, 358
Elastic-Net is capable of automatically taking such correlation 359
information into account through adding a quadratic termu2 360
to the penalty. We can adjust the value of α1 to control 361
the sparsity degree, i.e., how many semantic concepts are 362
selected in our problem. The concepts to be selected are the 363
corresponding dimensions with non-zero vaules of u. 364
C. Union of Selected Concepts
365
To sum up, we form a union of the semantic concepts 366
selected from both text-based semantic relatedness described 367
in section 3.1 and visual high-level semantic representation 368
described in section 3.2 as the final list of selected concepts 369
for each MED event. All the confidence values used are 370
normalized to 0-1 using the Z-score normalization and are 371
used for ranking the concepts. In our paper, we give equal 372
weights to the textual and visual selection methods for the 373
final late fusion of confidence scores. Top 10 ranked concepts 374
are finally selected to adapt semantic information. Since we 375
select the most important concepts from the pool, the top 376
10 ranked concepts are usually already enough to describe 377
the event (see Fig. 8). To evaluate the effectiveness of our 378
proposed strategy, we compare the selected concepts with 379
the groundtruth (we use human labeled concepts list as the 380
groundtruth for each MED event). 381
IV. EVENTORIENTEDDICTIONARYLEARNING 382 After we select semantic meaningful concepts for each 383 event, we can leverage training samples of selected concepts 384 from the SIN dataset into a supervised multi-task dictionary 385 learning framework. In this section, we investigate how to 386 learn an event oriented dictionary representation. To accom- 387 plish this goal, we firstly propose our multi-task ditionary 388 learning framework and then introduce its supervised setting. 389
A. Multi-Task Dictionary Learning 390
Given K tasks (e.g. K = 2 in our case, one task is the 391 MED dataset and the other task is the subset of SIN dataset 392 where samples are collected from specified selected concepts 393 for each event), each task consists of data samples denoted 394 by Xk = {x1k, x2k, . . . , xknk} ∈ Rnk×d, (k = 1, . . . , K ), where 395
xik ∈ Rd is a d-dimensional feature vector and nk is the 396 number of samples in the k-th task. We are going to learn a 397 shared subspace across all tasks, obtained by an orthonormal 398 projection W ∈ Rd×s, where s is the dimensionality of the 399 subspace. In this learned subspace, the data distributions from 400 all tasks should be similar to each other. Therefore, we can 401 code all tasks together in the shared subspace and achieve 402 better coding quality. The benefits of this strategy are: (i) we 403 can improve each individual coding quality by transferring 404 knowledge across all tasks. (ii) we can discover the rela- 405 tionship among different datasets via coding analysis. Such 406 a purpose can be realized through the following optimization 407
problem: 408 min Tk,Ck,W,D K k=1 Xk− CkTk2F+ λ1 K k=1 Ck1 409 + λ2 K k=1 XkW− CkD2F 410 s.t. ⎧ ⎨ ⎩ WTW= I (Tk)j·(Tk)Tj·≤ 1, ∀ j = 1, . . . , l Dj·DTj·≤ 1, ∀ j = 1, . . . , l (1) 411
where Tk ∈ Rl×d is an overcomplete dictionary (l > d) 412 with l prototypes of the k-th task, (Tk)j· in the constraints 413 denotes the j -th row of Tk, and Ck ∈ Rnk×l corresponds 414 to the sparse representation coefficients of Xk. In the third 415 term of Eqn.(1), Xk is projected by W into the subspace to 416 explore the relationship among different tasks. D ∈ Rl×s is 417 the dictionary learned in the datasets’ shared subspace. Dj· in 418 the constraints denotes the j -th row of D. I is the identity 419 matrix. (·)T denotes the transpose operator. λ1 and λ2 are 420 the regularization parameters. The first constraint guarantees 421 the learned W to be orthonormal, and the second and third 422 constraints prevent the learned dictionary to be arbitrarily 423 large. In our objective function, we learn a dictionary Tk for 424 each task k and one shared dictionary D among k tasks. Since 425 one task in our model uses samples from the SIN dataset 426 of selected semantic meaningful concepts, the shared 427 learned dictionary D is the event oriented dictionary. When 428
λ2= 0, Eqn.(1) reduces to the traditional dictionary learning 429
IEEE
Proof
B. Supervised Multi-Task Dictionary Learning
431
It is well-known that the traditional dictionary learning 432
framework is not directly available for classification and the 433
learned dictionary has merely been used for signal recon-434
struction [17]. To circumvent this problem, researchers have 435
developed several algorithms to learn a classification-oriented 436
dictionary in a supervised learning fashion by exploring the 437
label information. In this subsection, we extend our proposed 438
multi-task dictionary learning of Eqn.(1) to be suitable for 439
event detection. 440
Assuming that the k-th task has mk classes, the label
441
information of the k-th task is Yk = {yk1, y2k, . . . , y nk k } ∈ 442 Rnk×mk (k = 1, . . . , K ), yi k = [0, . . . , 0, 1, 0, . . . , 0] 443
(the position of non-zero element indicates the class). 444
k ∈ Rl×mk is the parameter of the k-th task classifier. 445
Inspired by [43], we consider the following optimization 446 problem: 447 min Tk,Ck,k,W,D K k=1 Xk− CkTk2F + λ1 K k=1 Ck1 448 +λ2 K k=1 XkW− CkD2F + λ3 K k=1 Yk− Ckk2 F 449 s.t. ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ WTW= I (Tk)j·(Tk)Tj· ≤ 1, ∀ j = 1, . . . , l Dj·DTj· ≤ 1, ∀ j = 1, . . . , l (2) 450
Compared with Eqn.(1), we add the last term into Eqn.(2) 451
to enforce the model involving discriminative information 452
for classification. This objective function can simultane-453
ously achieve a desired dictionary with good representation 454
power and support optimal discrimination of the classes for 455
multi-task setting. 456
1) Optimization: To solve the proposed objective problem
457
of Eqn.(2), we adopt the alternating minimization algorithm to 458
optimize it with respect to D, Tk, Ck,kand W respectively 459
in five steps as follows: 460
Step1 (Fixing Tk, Ck, W, Θk, Optimize D): If we
461 stack X= [XT1, . . . , XkT]T, C= [CT1, . . . , CTk]T, Eqn.(2) is 462 equivalent to: 463 min D K k=1 XkW− CkD2F = min D XW − CD 2 F 464 s.t. Dj·DTj· ≤ 1, ∀ j = 1, . . . , l 465
This is equivalent to the dictionary update stage in tradi-466
tional dictionary learning algorithm. We adopt the dictionary 467
update strategy of [21, Algorithm 2] to efficiently solve it. 468
Step2 (Fixing D, Ck, W, Θk, Optimize Tk): Eqn.(2) is
469 equivalent to: 470 min Tk Xk− CkTk2F 471 s.t. (Tk)j·(Tk)Tj·≤ 1, ∀ j = 1, . . . , l 472
This is also equivalent to the dictionary update stage in 473
traditional dictionary learning for k tasks. We adopt the 474
dictionary update strategy of [21, Algorithm 2] to efficiently 475
solve it. 476
Step3 (Fixing Tk, W, D, Θk, Optimize Ck): Eqn.(2) is 477
equivalent to: 478 min Ck K k=1 Xk− CkTk2F + λ1 K k=1 Ck1 479 + λ2 K k=1 XkW− CkD2F+ λ3 K k=1 Yk− Ckk2 F 480 This formulation can be decoupled into(n1+n2+. . .+nk) 481
distinct problems: 482 min ci k K k=1 nk i=1 (xi k− c i kTk 2 2+ λ1 ci k 1 483 + λ2xikW− c i kD 2 2+ λ3 yi k− c i kk 2 2) 484
We adopt the Fast Iterative Shrinkage-Thresholding 485 Algorithm (FISTA) [44] to solve the problem. FISTA solves 486 the optimization problems in the form of min
μ f(μ) + r(μ), 487
where f(μ) is convex and smooth, and r(μ) is convex but 488 non-smooth. We adopt FISTA since it is a popular tool 489 for solving many convex smooth/non-smooth problems and 490 its effectiveness has been verified in many applications. In 491 our setting, we denote the smooth term part as f(cik) = 492 xi k− cikTk 2 2+ λ2x i kW− cikD 2 2+λ3y i k− cikk 2 2and the 493 non-smooth term part as g(cki) = λ1cik
1. 494
Step4 (Fixing D, Ck, W, Tk, Optimize Θk): Eqn.(2) is 495
equivalent to: 496 min k Yk− Ckk2 F 497 Setting ∂∂ k = 0, we obtain k= (C T kCk)−1C T kYk. 498
Step5 (Fixing Tk, Ck, D, Θk, Optimize W): If we 499 stack X= [XT1, . . . , XTk]T, C= [CT1, . . . , CTk]T, Eqn.(2) is 500 equivalent to: 501 min W K k=1 XkW− CkD2F = min W XW − CD 2 F 502 s.t. WTW= I 503 Substituting D= (CTC)−1CTXW back into the above 504
function, we achieve 505 min W (I − C(CT C)−1CT)XW 2 F 506 = min W tr(W T XT(I − C(CTC)−1CT)XW) 507 s.t. WTW= I 508 The optimal W is composed of eigenvectors of the matrix 509
XT(I − C(CTC)−1CT)X corresponding to the s smallest 510
eigenvalues. 511
We summarize our algorithm for solving Eqn.(2) 512
IEEE
Proof
Algorithm 1 Supervised Multi-Task Dictionary Learning
C. Supervised Multi-Task p-Norm Dictionary Learning
514
In the literature it has been shown that using a non-convex 515
p-norm minimization (0 ≤ p < 1) can often yield better
516
results than the convex1-norm minimization. Inspired by this, 517
we extend our supervised multi-task dictionary learning model 518
to a supervised multi-taskp-norm dictionary learning model.
519
Assuming that the k-th task has mk classes, the label
520
information of the k-th task is Yk = {yk1, y2k, . . . , ynkk} ∈ 521
Rnk×mk, (k = 1, . . . , K ), yi
k = [0, . . . , 0, 1, 0, . . . , 0] (the 522
position of non-zero element indicates the class).k∈ Rl×mk 523
is the parameter of the k-th task classifier. We formulate our 524
supervised multi-task p-norm dictionary learning problem
525 as follows: 526 min Tk,Ck,k,W,D K k=1 Xk− CkTk2F + λ1 K k=1 Ckp p 527 +λ2 K k=1 XkW− CkD2F + λ3 K k=1 Yk− Ckk2 F 528 s.t. ⎧ ⎨ ⎩ WTW= I (Tk)j·(Tk)Tj·≤ 1, ∀ j = 1, . . . , l Dj·DTj·≤ 1, ∀ j = 1, . . . , l (3) 529
Compared with Eqn.2, we replace the traditional sparse 530
coding 1-norm termCk1 with the more flexiblep-norm
531
term Ckpp. Since we can adjust the value of p (0≤ p < 1)
532
in our framework, our algorithm is more flexible to control the 533
sparseness of the feature representation, thus usually resulting 534
in better performance than the traditional 1-norm sparse 535
coding. 536
To solve the proposed problem of Eqn.3, we adopt the 537
alternating minimization algorithm to optimize it with respect 538
to D, Tk, Ck, k, W respectively. The updated rules for 539
D, Tk, k, W are the same as Eqn.2, the only difference is 540
in the updated rule of Ck. Various algorithms have been 541
proposed for p-norm non-convex sparse coding [45]–[47].
542
In this paper, we adopt the Generalized Iterated Shrinkage 543
Algorithm (GISA) [48] to solve the proposed problem. 544
Algorithm 2 Supervised Multi-Task p-Norm Dictionary
Learning
TABLE I
18 EVENTS OFMED10ANDMED11
We summarize our algorithm for solving Eqn.3 as 545
Algorithm 2. 546
After the optimized is obtained, the final classification of 547 a test video can be obtained based on its sparse coefficient cik, 548 which carries the discriminative information. We can simply 549 apply the linear classifier cikk to obtain the predicted score 550
of the video. 551
V. EXPERIMENTS 552
In this section, we conduct extensive experiments to test our 553 proposed method using large-scale real world datasets. 554
A. Datasets 555
TRECVID MED10 (P001-P003) and MED11 (E001-E015) 556 datasets are used in our experiments. The datasets consist of 557 9746 videos from 18 events of interest, with 100-200 examples 558 per event, and the rest of the videos are from the background 559 class. The details are listed in the Table 1. 560 TRECVID Semantic Indexing Task (SIN) [7] contains anno- 561 tation for 346 semantic concepts on 400,000 keyframes from 562 web videos. 346 concepts are related to objects, actions, 563
IEEE
Proof
TABLE II15 GROUPS OFSIN 346 VISUALCONCEPTS(THENUMBER OF CONCEPTS FOREACHGROUPARE INPARENTHESIS)
scences, attributes and non-visual concept which are all the 564
basic elements for an event, e.g. kitchen, boy, girl, bus. For 565
the sake of better understanding and easy concept selection, 566
we manually divide the 346 visual concepts into 15 groups 567
which are listed in Table 2. 568
B. Evaluation Metrics
569
1) Average Precision (AP): is a measure combining recall
570
and precision for ranked retrieval results. The average preci-571
sion is the mean of the precision scores after each relevant 572
sample is retrieved. The higher number indicates the better 573
performance. 574
2) PMiss@TER = 12.5: is an official evaluation metric for
575
event detection as defined by NIST [38]. It is defined as the 576
point at which the ratio between the probability of Missed 577
Detection and probability of False Alarm is 12.5:1. The lower 578
number indicates the better performance. 579
3) Normalized Detection Cost (NDC): is an official
evalua-580
tion metric for event detection as defined by NIST [38]. It is a 581
weighted linear combination of the system’s Missed Detection 582
and False Alarm probabilities. NDC measures the performance 583
of a detection system in the context of an application profile 584
using error rate estimates calculated on a test set. The lower 585
number indicates the better performance. 586
C. Experiment Settings and Comparison Methods
587
There are 3104 videos used for training and 6642 videos 588
used for testing in our experiments. We use three representative 589
features which are SIFT, Color SIFT (CSIFT) and Motion 590
SIFT (MOSIFT) [4]. SIFT and CSIFT describe the gradient 591
and color information of images. MOSIFT describes both the 592
optical flow and gradient information of video clips. Finally, 593
768D SIFT-BoW, CSIFT-BoW, MOSIFT-BoW features are 594
extracted respectively to represent each video. We set the reg-595
ularization parameters in the range of {0.01, 0.1, 1, 10, 100}. 596
The subspace dimensionality s is set by searching the grid 597
from {200, 400, 600}. For the experiments in the paper, we 598
try three different dictionary sizes from {768, 1024, 1280}. 599
To evaluate the multi-task p-norm dictionary learning
600
algorithm, the parameter p is tuned in the range of {0.2, 0.4, 601
0.6, 0.8, 1}. We compare our proposed event oriented 602
dictionary learning method with the following important 603
baselines: 604
• Support Vector Machine (SVM): SVM has been widely
605
used by several research groups for MED and has shown 606
its robustness [8], [49], [50], so we use it as one of the 607
comparison algorithms (RBF, Histogram Intersection and 608
χ2 kernels are used respectively); 609
Fig. 5. Example results of semantic concept selection proposed in section 3 on event (top left) Attempting board trick, (top right) Feeding animal, (bottom left) Flash mob gathering, (bottom right) Making a sandwich. The font size in the figure reflects the ranking value for concepts (the larger the font the higher the value).
TABLE III
AP PERFORMANCE FOREACHMED EVENTUSINGTEXT(T), VISUAL(V)ANDTEXT+ VISUAL(T+ V) INFORMATION
FORCONCEPTSELECTION. THELASTCOLUMNSHOWS THENUMBER OFCONCEPTS IN THETOP10 THAT
COINCIDEWITH THEGROUNDTRUTH
• Single Task Supervised Dictionary Learning (ST-SDL): 610 Performing supervised dictionary learning on each task 611
separately; 612
• Pooling Tasks Supervised Dictionary Learning (PT-SDL): 613 Performing single task supervised dictionary learning by 614 simply aggregating data from all tasks; 615
• Multiple Kernel Transfer Learning (MKTL) [51]: A 616 method incorporating prior features into a multiple kernel 617 learning framework. We use the code provided by the 618
author1; 619
• Dirty Model Multi-Task Learning (DMMTL) [24]: A 620 state-of-the-art multi-task learning method imposing 621
1/q-norm regularization. We use the code provided by 622
MALSAR toolbox2; 623
• Multiple Kernel Learnig Latent Variable Approach 624
(MKLLVA) [52]: A multiple kernel learning latent 625 variable approach for complex video event detection; 626
• Random Concept Selection Strategy (RCSS): Performing 627 our proposed supervised multi-task dictionary learning 628
1http://homes.esat.kuleuven.be/~ttommasi/source_code_ICCV11.html 2http://www.public.asu.edu/~jye02/Software/MALSAR/
IEEE
Proof
TABLE IVCOMPARISON OFAverage DETECTIONACCURACY OFDIFFERENTMETHODS FOR THESIFT FEATURE. BESTRESULTSAREHIGHLIGHTED INBOLD
Fig. 6. Comparison of AP performance of different methods for each MED event. (Figure is best viewed in color and under zoom).
without involving concept selection strategy (leveraging
629
random samples). 630
• Simple Concept Selection Strategy (SCSS): Each concept
631
is independently used as a predictor for each event 632
yielding a certain average precision performance for 633
each event. The concepts are ranked according to the 634
AP measurement for each event, and the top 10 concepts 635
retained. 636
D. Experimental Results
637
We firstly calculate the covariance matrix of the 346 SIN 638
concepts, shown in Fig. 4, where the visual concepts are 639
grouped and sorted the same as in Table 2. As shown in Fig. 4, 640
the convariance matrix shows high within-cluster correlations 641
along the diagonal direction and also relatively high corre-642
lations between contextually related clusters (in red color), 643
such as ‘G4:car’, ‘G7:nature’ and ‘G11:urban-scene’. One can 644
also easily observe negative correlations between contextually 645
unrelated clusters (in blue color), such as ‘G7:nature’ and 646
‘G8:indoor’. This gives us the intuition that visual concepts 647
co-occurrence exists. Therefore, removing redundant concepts 648
and selecting related concepts for each event is expected to be 649
helpful for the event detection task. 650
Fig. 5 shows the results of our concept selection strategy for 651
the event ‘Attempting board trick’, ‘Feeding animal’, ‘Flash 652
mob gathering’ and ‘Making a sandwich’. From Fig. 5, we can 653
observe that the concepts selected are reasonably consistent 654
with human selections. 655
To better exploit the effectiveness of our proposed concept 656
selection strategy, we compare our selected top 10 concepts 657
with the groundtruth (we use the ranking list of human 658
labeled concepts as the groundtruth for each MED event). 659
The results are listed in the last column of Table 3, showing 660
the number of concepts in the top 10 that coincide with the 661
groundtruth. The AP performance for event detection based on 662
text information, visual information and their combinations 663 are also shown in Table 3. The benefit of using both text 664 and visual information for concept selection can be concluded 665
from Table 3. 666
Table 4 shows the average detection results of the 18 MED 667 events for different comparison methods. We have the 668 following observations: (1) Comparing ST-SDL with SVM, 669 we observe that performing supervised dictionary learning is 670 better than SVM which shows the effectiveness of dictionary 671 learning for MED. (2) Comparing PT-SDL with ST-SDL, 672 leveraging knowledge from the SIN dataset improves the 673 performance for MED. (3) Our concept selection strategy 674 for semantic dictionary learning performs the best for 675 MED among all the comparison methods. (4) Our proposed 676 method outperforms by 8%, 6%, 10% with respect to AP, 677 PMiss@TER=12.5 and MinNDC respectively compared with 678 SVM. (5) Considering the difficulty of MED dataset and 679 the typically low AP performance of MED, the absolute 680 8% AP improvement is very significant. 681 Fig. 6 shows the AP results for each MED event. Our 682 proposed method achieves the best performance for 13 events 683 out of a total of 18 events. It is also interesting to notice that 684 the larger improvements in Fig. 6, such as ‘E004: Wedding 685 ceremony’, ‘E005: Working wood working project’ and 686 ‘E009: Getting a vehicle unstuck’ usually correspond to the 687 higher number of selected concepts that coincide with the 688 groundtruth. This gives us the evidence of the effectiveness 689 of our proposed automatic concept selection strategy. 690 Fig. 7 illustrates the MAP performance for different 691 methods based on SIFT, CSIFT and MOSIFT features. 692 It can be easily observed that our proposed surpervised multi- 693 task dictionary learning with our concept selection strategy 694 outperforms SVM by more than 8%. 695 Moreover, we evaluate our proposed method with respect 696 to different numbers of selected concepts, dictionary sizes and 697
IEEE
Proof
Fig. 7. Comparison of MAP performance of different methods for differenttypes of features.
Fig. 8. MAP performance variation with respect to the number of selected concepts.
Fig. 9. MAP performance variation with respect to (left) different dictionary size; (right) different subspace dimensionality.
different subspace dimensionality settings based on the SIFT 698
feature. Fig. 8 shows that the proposed method achieves the 699
best MAP results when the numbers of selected concepts is 10. 700
Fig. 9(left) shows that the proposed method achieves the best 701
MAP results when the dictionary size is 1024. Too large or 702
too small dictionary size tends to hamper the performance. 703
Fig. 9(right) shows that the best MAP result is achieved when 704
the subspace dimensionality is 400 (dictionary size = 1024). 705
Large or small subspace dimensionality also degrades the 706
performance. 707
Finally, we also study the parameter sensitivity of the 708
proposed method in Fig. 10. Here, we fix λ3 = 1 709
(discriminative information contribution fixed) and p = 0.6 710
and analyze the regularization parametersλ1andλ2. As shown 711
in Fig. 10(left), we observe that the proposed method is more 712
sensitive to λ2 compared with λ1, which demonstrates the 713
importance of the subspace for multi-task dictionary learning. 714
Moreover, to understand the influence of parameter p for our 715
proposed supervised p-norm dictionary learning algorithm,
716
we also perform an experiment on the parameter sensitivity. 717
Fig. 10(right) demonstrates that the best performance for the 718
Fig. 10. Sensitivity study of parameters on (left)λ1 andλ2 with fixed p
andλ3. (right) p andλ3with fixedλ1andλ2.
Fig. 11. Convergence rate study on (left) supervised multi-task dictionary learning and (right) supervised multi-taskp-norm dictionary learning.
supervisedp-norm dictionary learning algorithm is achieved 719 when p = 0.6. More than 2% MAP can be achieved if we 720 adopt the p-norm model compared with the fixed 1-norm 721 model. This shows the suboptimality of the traditional 722
1-norm sparse coding compared with the flexible p-norm 723
sparse coding. 724
The proposed iterative approaches monotonically decrease 725 the objective function values in Eqn.(2) and Eqn.(3) until 726 convergence. Fig. 11 shows the convergence rate curves of 727 our algorithms. It can be observed that the objective function 728 values converge quickly and our approaches usually converge 729 after 6 iterations for supervised multi-task dictionary learning 730 and 7 iterations for supervised multi-taskp-norm dictionary 731 learning at most (precision= 10−6). 732 Regarding the computational cost of our proposed algorithm 733 for supervised multi-task p-norm dictionary learning, we 734 train our model for TRECVID MED dataset in 5 hours 735 with cross-validation on a workstation with Intel(R) Xeon(R) 736 CPU E5-2620 v2 @ 2.10GHz × 17 processors. Our proposed 737 event oriented dicionary learning approach can be easily 738 parallelled on multi-core computers due to its ‘event 739 oriented’. This means that our algorithms would be scalable 740
for large-scale problems. 741
VI. CONCLUSION ANDFUTUREWORK 742 In this paper, we have firstly investigated the possibility 743 of automatically selecting semantic meaningful concepts for 744 complex event detection based on both the MED events-kit 745 text descriptions and the high-level concept feature 746 descriptions. Then, we attempt to learn an event oriented 747 dictionary representation based on the selected semantic 748 concepts. To this aim, we leverage training samples of 749 selected concepts from the SIN dataset into a novel jointly 750 supervised multi-task dictionary learning framework. 751
IEEE
Proof
Extensive experimental results on the TRECVID MED752
dataset showed that our proposed method outperformed several 753
important baseline methods. Our proposed method outper-754
formed SVM by up to 8% MAP which showed the effective-755
ness of dictionary learning for TRECVID MED. More than 756
6% and 3% MAP was achieved respectively compared with 757
ST-SDL and PT-SDL, which showed the advantage of multi-758
task setting in our proposed framework. To show the benefit of 759
concept selection strategy, we compared RCSS to our method 760
and showed that achieves 4% less MAP. 761
For some sparse coding problems, non-convex p-norm
762
minimization (0 ≤ p < 1) can often obtain better results 763
than the convex 1-norm minimization. Inspired by this, we 764
extended our supervised multi-task dictionary learning model 765
to a supervised multi-taskp-norm dictionary learning model.
766
We evaluated the influence of thep-norm parameter p in our
767
proposed problem and found that more than 2% MAP can 768
be achieved if we adopted the more flexible p-norm model
769
compared with the fixed 1-norm model. 770
Overall, the proposed multi-task dictionary learning 771
solutions are novel in the context of complex event detection, 772
which is a relevant and important research problem in 773
applications such as image and video understanding and 774
surveillance. Future research involves (i) integration of 775
knowledge from multiple sources (video, audio, text) and 776
incorporation of kernel learning in our framework, and (ii) the 777
use of deep structures instead of a shallow single-layer model 778
in the proposed problem since deep learning has achieved the 779
supreme success in many different fields of image processing 780
and computer vision. 781
ACKNOWLEDGEMENT 782
Any opinions, findings, and conclusions or recommenda-783
tions expressed in this material are those of the authors and do 784
not necessarily reflect the views of ARO, the National Science 785
Foundation or the U.S. Government. 786
REFERENCES 787
[1] A. Tamrakar et al., “Evaluation of low-level features and their combina-788
tions for complex event detection in open source videos,” in Proc. IEEE 789
Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 3681–3688.
790
[2] Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann, 791
“Complex event detection via multi-source video attributes,” in Proc. 792
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2627–2633.
793
[3] Y. Yan, Y. Yang, H. Shen, D. Meng, G. Liu, and N. S. A. Hauptmann, 794
“Complex event detection via event oriented dictionary learning,” in 795
Proc. AAAI Conf. Artif. Intell., 2015.
AQ:3 796
[4] M.-Y. Chen and A. Hauptmann, “MoSIFT: Recognizing human actions 797
in surveillance videos,” School Comput. Sci., Carnegie Mellon Univ., 798
Pittsburgh, PA, USA, Tech. Rep. CMU-CS-09-161, 2009. 799
[5] C. G. M. Snoek and A. W. M. Smeulders, “Visual-concept search 800
solved?” Computer, vol. 43, no. 6, pp. 76–78, Jun. 2010. 801
[6] C. Sun and R. Nevatia, “ACTIVE: Activity concept transitions in video 802
event classification,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, 803
pp. 913–920. 804
[7] SIN. (2013). NIST TRECVID Semantic Indexing. [Online]. Available: 805
http://www-nlpir.nist.gov/projects/tv2013/tv2013.html#sin 806
[8] Z.-Z. Lan et al., “Informedia E-lamp @ TRECVID 2013 multimedia 807
event detection and recounting (MED and MER),” in Proc. NIST 808
TRECVID Workshop, 2013.
809
[9] L. Jiang, A. G. Hauptmann, and G. Xiang, “Leveraging high-level and 810
low-level features for multimedia event detection,” in Proc. ACM Int. 811
Conf. Multimedia, 2012, pp. 449–458.
812
[10] P. Natarajan et al., “Multimodal feature fusion for robust event detection 813 in web videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 814
Jun. 2012, pp. 1298–1305. 815
[11] K. Tang, D. Koller, and L. Fei-Fei, “Learning latent temporal structure 816 for complex event detection,” in Proc. IEEE Conf. Comput. Vis. Pattern 817
Recognit., Jun. 2012, pp. 1250–1257. 818 [12] Z.-Z. Lan, L. Bao, S.-I. Yu, W. Liu, and A. G. Hauptmann, “Double 819 fusion for multimedia event detection,” in Proc. Int. Conf. Multimedia 820
Modeling, 2012, pp. 173–185. 821 [13] S. Oh et al., “Multimedia event detection with multimodal feature fusion 822 and temporal concept localization,” Mach. Vis. Appl., vol. 25, no. 1, 823
pp. 49–69, Jan. 2014. 824
[14] L. Brown et al., “IBM Research and Columbia University 825 TRECVID-2013 multimedia event detection (MED), multimedia event 826 recounting (MER), surveillance event detection (SED), and semantic 827 indexing (SIN) systems,” in Proc. NIST TRECVID Workshop, 2013. 828 [15] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid 829 matching using sparse coding for image classification,” in Proc. IEEE 830
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 1794–1801. 831 [16] M. Elad and M. Aharon, “Image denoising via sparse and redundant 832 representations over learned dictionaries,” IEEE Trans. Image Process., 833 vol. 15, no. 12, pp. 3736–3745, Dec. 2006. 834 [17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina- 835 tive learned dictionaries for local image analysis,” in Proc. IEEE Conf. 836
Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. 837 [18] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for 838 designing overcomplete dictionaries for sparse representation,” IEEE 839
Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. 840 [19] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding 841 algorithms,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2006, 842
pp. 801–808. 843
[20] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary 844 for sparse coding via label consistent K-SVD,” in Proc. IEEE Conf. 845
Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1697–1704. 846 [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning 847 for sparse coding,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, 848
pp. 689–696. 849
[22] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in 850
Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2004, 851
pp. 109–117. 852
[23] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” 853 in Proc. Conf. Adv. Neural Inf. Process. Syst., 2007, pp. 1–8. 854 [24] A. Jalali, P. K. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model 855 for multi-task learning,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 856
2010. 857
[25] S. Ji and J. Ye, “An accelerated gradient method for trace norm 858 minimization,” in Proc. Int. Conf. Mach. Learn., 2009, pp. 457–464. 859 [26] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low-rank 860 patterns from multiple tasks,” in Proc. ACM SIGKDD Int. Conf. Knowl. 861
Discovery Data Mining, 2010, p. 22. 862 [27] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learning: 863 A convex formulation,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 864
2008. 865
[28] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via alternating 866 structure optimization,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 867
2011, pp. 702–710. 868
[29] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse 869 structures for robust multi-task learning,” in Proc. ACM SIGKDD Int. 870
Conf. Knowl. Discovery Data Mining, 2011, pp. 42–50. 871 [30] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for 872 multitask and transfer learning,” in Proc. Int. Conf. Mach. Learn., 2013, 873
pp. 343–351. 874
[31] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparse 875 representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 876
Jun. 2010, pp. 3493–3500. 877
[32] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe, “No matter 878 where you are: Flexible graph-guided multi-task learning for multi-view 879 head pose classification under target motion,” in Proc. IEEE Int. Conf. 880
Comput. Vis., Dec. 2013, pp. 1177–1184. 881 [33] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via 882 multi-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. Pattern 883
Recognit., Jun. 2012, pp. 2042–2049. 884 [34] Y. Yan, E. Ricci, R. Subramanian, G. Liu, and N. Sebe, “Multitask linear 885 discriminant analysis for view invariant action recognition,” IEEE Trans. 886
IEEE
Proof
[35] Y. Yan, E. Ricci, G. Liu, and N. Sebe, “Recognizing daily activities888
from first-person videos with multi-task clustering,” in Proc. Asian Conf. 889
Comput. Vis., 2014, pp. 1–16.
890
[36] C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. Cambridge, 891
MA, USA: MIT Press, 1998. 892
[37] M. Strube and S. P. Ponzetto, “WikiRelate! Computing semantic related-893
ness using Wikipedia,” in Proc. AAAI Conf. Artif. Intell., 2006, pp. 1–6. 894
[38] NIST. (2013). NIST TRECVID Multimedia Event Detection. [Online]. 895
Available: http://www.nist.gov/itl/iad/mig/med13.cfm 896
[39] D. Lin, “An information-theoretic definition of similarity,” in Proc. 5th 897
Int. Conf. Mach. Learn., 1998, pp. 296–304.
898
[40] P. Resnik, “Using information content to evaluate semantic similarity 899
in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, 900
pp. 448–453. 901
[41] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically 902
meaningful key frames from personal video clips: From humans to 903
computers,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 2, 904
pp. 289–301, Feb. 2009. 905
[42] H. Zou and T. Hastie, “Regularization and variable selection via the 906
elastic net,” J. Roy. Statist. Soc., Ser. B, vol. 67, no. 2, pp. 301–320, 907
Apr. 2005. 908
[43] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in 909
face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 910
Jun. 2010, pp. 2691–2698. 911
[44] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding 912
algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, 913
pp. 183–202, 2009. 914
[45] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from 915
limited data using FOCUSS: A re-weighted minimum norm algorithm,” 916
IEEE Trans. Signal Process., vol. 45, no. 3, pp. 600–616, Mar. 1997.
917
[46] Y. She, “Thresholding-based iterative selection procedures for model 918
selection and shrinkage,” Electron. J. Statist., vol. 3, pp. 384–415, 919
Nov. 2009. 920
[47] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-921
Laplacian priors,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2009, 922
pp. 1033–1041. 923
[48] W. Zuo, D. Meng, L. Zhang, X. Feng, and D. Zhang, “A generalized 924
iterated shrinkage algorithm for non-convex sparse coding,” in Proc. 925
IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 217–224.
926
[49] L. Cao et al., “IBM Research and Columbia University TRECVID-2011 927
multimedia event detection (MED) system,” in Proc. NIST TRECVID 928
Workshop, Dec. 2011.
929
[50] D. Oneata et al., “AXES at TRECVid 2012: KIS, INS, and MED,” in 930
Proc. NIST TRECVID Workshop, 2012.
931
[51] L. Jie, T. Tommasi, and B. Caputo, “Multiclass transfer learning from 932
unconstrained priors,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, 933
pp. 1863–1870. 934
[52] A. Vahdat, K. Cannons, G. Mori, S. Oh, and I. Kim, “Compositional 935
models for video event detection: A multiple kernel learning latent 936
variable approach,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, 937
pp. 1185–1192. 938
Yan Yan received the Ph.D. degree from the 939
University of Trento, in 2014. He is currently a 940
Post-Doctoral Researcher with the MHUG Group, 941
University of Trento. His research interests include 942
machine learning and its application to computer 943
vision and multimedia analysis. 944
Yi Yang received the Ph.D. degree in computer 945
science from Zhejiang University. He was a 946
Post-Doctoral Research Fellow with the School 947
of Computer Science, Carnegie Mellon University, 948
from 2011 to 2013. He is currently a Senior 949
Lecturer with the Centre for Quantum Computation 950
and Intelligent Systems, University of Technology 951
at Sydney, Sydney. His research interests include 952
multimedia and computer vision. 953
Deyu Meng (M’13) received the B.Sc., M.Sc., 954 and Ph.D. degrees from Xi’an Jiaotong University, 955 Xi’an, China, in 2001, 2004, and 2008, respectively. 956 He was a Visiting Scholar with Carnegie Mellon 957 University, Pittsburgh, PA, USA, from 2012 to 2014. 958 He is currently an Associate Professor with the 959 Institute for Information and System Sciences, 960 School of Mathematics and Statistics, Xi’an 961 Jiaotong University. His current research interests 962 include principal component analysis, nonlinear 963 dimensionality reduction, feature extraction and 964 selection, compressed sensing, and sparse machine learning methods. 965
Gaowen Liu received the B.S. degree in automation 966 from Qingdao Univerisity, China, in 2006, and 967 the M.S. degree in system engineering from the 968 Nanjing University of Science and Technology, 969 China, in 2008. She is currently pursuing the 970 Ph.D. degree with the MHUG Group, University of 971 Trento, Italy. Her research interests include machine 972 learning and its application to computer vision and 973
multimedia analysis. 974
Wei Tong received the Ph.D. degree in computer 975 science from Michigan State University, in 2010. 976 He is currently a Researcher with General Motors 977 Research and Development Center. His research 978 interests include machine learning, computer vision, 979 and autonomous driving. 980
Alexander G. Hauptmann received the B.A. and 981 M.A. degrees in psychology from Johns Hopkins 982 University, Baltimore, MD, the degree in computer 983 science from the Technische Universität Berlin, 984 Berlin, Germany, in 1984, and the Ph.D. degree 985 in computer science from Carnegie Mellon 986 University (CMU), Pittsburgh, PA, in 1991. 987 He worked on speech and machine translation 988 from 1984 to 1994, when he joined the Informedia 989 project for digital video analysis and retrieval, 990 and led the development and evaluation of 991 news-on-demand applications. He is currently a Faculty Member with 992 the Department of Computer Science and the Language Technologies 993 Institute, CMU. His research interests include several different areas, 994 including man–machine communication, natural language processing, speech 995 understanding and synthesis, video analysis, and machine learning. 996
AQ:4
Nicu Sebe (M’01–SM’11) received the Ph.D. degree 997 from Leiden University, The Netherlands, in 2001. 998 He is currently with the Department of Information 999 Engineering and Computer Science, University of 1000 Trento, Italy, where he leads the research in the 1001 areas of multimedia information retrieval and human 1002 behavior understanding. He is a Senior Member of 1003 the Association for Computing Machinery and a 1004 fellow of the International Association for Pattern 1005 Recognition. He was the General Co-Chair of 1006 FG 2008 and ACM Multimedia 2013, and the 1007 Program Chair of CIVR 2007 and 2010, and ACM Multimedia 2007 and 2011. 1008 He will be the Program Chair of ECCV 2016 and ICCV 2017. 1009