Organizing egocentric videos of daily living activities Pattern Recognition

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Organizing egocentric videos of daily living activities

Alessandro Ortis

^a

, Giovanni M. Farinella

^a^,^∗

, Valeria D’Amico

^b

, Luca Addesso

^b

, Giovanni Torrisi

^b

, Sebastiano Battiato

^a

a Image Processing Laboratory, Dipartimento di Matematica e Informatica, Universitã degli studi di Catania, Viale A. Doria 6, Catania - 95125 Italy

b TIM - Telecom, JOL WAVE, Viale A. Doria 6, Catania - 95125 Italy

a rt i c l e i n f o

Article history:

Received 4 October 2016 Revised 3 May 2017 Accepted 7 July 2017 Available online 14 July 2017 Keywords:

First person vision Video summarization Video indexing

a b s t r a c t

Egocentricvideosarebecomingpopularsincethepossibilitytoobservethesceneﬂowfromtheuser’s pointofview(FirstPersonVision).Amongthedifferentapplicationsofegocentricvisionisthedailyliving monitoringofauserwearingthecamera.Weproposeasystemabletoautomaticallyorganizeegocentric videosacquiredbytheuseroverdifferentdays.Throughanunsupervisedtemporalsegmentation,each egocentricvideoisdividedinchaptersbyconsideringthevisualcontent.Theobtainedvideosegments relatedtothedifferentdaysarehenceconnectedaccordingtothescenecontextinwhichtheuseracts.

Experimentsonachallenging egocentricvideo datasetdemonstrate theeffectiveness ofthe proposed approachthatoutperformswithagoodmarginthestateoftheartinaccuracyandcomputationaltime.

1. Introductionandmotivations

Inthe last yearstherehasbeen arapidemerging ofwearable devices,includingbodysensors,smartclothingandwearablecam- eras.Thesetechnologiescanhaveasigniﬁcantimpactonourlives iftheacquireddataareconsidered toassisttheusersintasksre- lated to themonitoring of the quality oflife [1–4]. Inparticular, egocentriccamerasenabledthedesignanddevelopmentofuseful systems that can be organizedintothree generalcategories with respecttotheassistivetasks:

(i) UserAware-systems abletounderstand whattheuserisdo- ingandwhat/howheinteractwith,byrecognizingactionsand behavioursfromﬁrstpersonperspective.

(ii) (ii) Environment/Objects Aware - systems able to understand what objects surround the user, where they are withrespect theuser’sperspectiveandwhattheenvironmentlookslike.

(iii) TargetAware-systemsabletounderstandwhatothersaredo- ing, andhow they interactwith the userthat is wearingthe device.

Theegocentricmonitoringofaperson’sdailyactivitiescanhelp tostimulatethememoryofusersthatsufferfrommemorydisor- ders [5].Several works on recognition andindexing of daily liv-

∗ Corresponding author.

E-mail addresses: ortis@dmi.unict.it (A. Ortis), gfarinella@dmi.unict.it (G.M. Farinella), valeria1.damico@telecomitalia.it (V. D’Amico), luca.addesso@telecomitalia.it (L. Addesso), giovanni.torrisi@telecomitalia.it (G.

Torrisi), battiato@dmi.unict.it (S. Battiato).

ing activities of patients with dementia have been recently proposed [6–9]. The exploitation of aids for people with memory problems is proved to be one of the most effective ways to aid rehabilitation[10].Furthermore,therecordingandorganizationof dailyhabits performed by a patient can help a doctor to have a betteropinionwithrespecttothespeciﬁcpatient’sbehaviourand hence his health needs. To this aim, a set of egocentric videos recorded among different days with a camera hold by a patient canbeanalysedbyexpertstomonitortheuser’sdailylivingactiv- itiesforassistivepurposes.Theliverecordingforlifeloggingappli- cationsposes challengesonhow toperformautomatic indexand summarizationofbigpersonalmultimediadata[11].

Beside assistive technologies, the segmentation and semantic organization of egocentric videos is useful in many application scenarios where wearable cameras have recently become popular, including lifelogging [11], law enforcement [12] and social cameras [13]. Other applications of egocentric video analysis are related to action and activity recognition from egocentric videos[14–16],andrecognitionofinteraction-levelactivities from videosinﬁrst-person view [17] (e.g.,recognition ofhuman-robot interactions).Foralltheseapplications, thesegmentation ofdaily egocentric videos into meaningful chapters and the semantic organization of such video segments related to different days is animportantﬁrststepwhich addsstructure toegocentricvideos, allowing tools for the indexing, browsing and summarization of egocentricvideosets.

Inthelast years,severalpapershaveaddresseddifferentprob- lems related to vision tasks from ﬁrst person perspective. The work in [18] proposes a temporal segmentation method of ego- http://dx.doi.org/10.1016/j.patcog.2017.07.010

(2)

centricvideoswithrespectto12 differentactivities organizedhi- erarchicallyupon cuesbased on wearer’smotion (e.g.,static, sit- ting,standing, walking,etc.). Abenchmark studyconsidering different wearable devices for context recognition with a rejection mechanismispresented.Thesystemdiscussedin[19]aimsatseg- mentingunstructured egocentricvideosto highlightthepresence ofgivenpersonal contexts ofinterest. In[4]the authorsperform a benchmark on the main representations and wearable devices usedforthetaskofcontext recognitionfromegocentricvideos. A methodthat takes a long input video andreturns a set ofvideo subshotsdepictingthe essentialmoments isdetailedin [20].The methodproposed in[21]learnsthesequencesofactions involved inasetofhabitualdailyactivitiestopredict thenext actionsand generatenotiﬁcationsiftherearemissingactionsinthesequence.

Theframework presentedin[13](RECfusion), isabletoautomat- icallyprocess multiplevideo ﬂows fromdifferent mobile devices tounderstand themostpopular scenes fora group ofend-users.

Theoutput is a video which representsthe most popularscenes organizedovertime.

Inthispaper,webuildontheRECfusionmethod[13]improving itforthecontextofdailylivingmonitoringfromegocentricvideos.

InRECfusionmultiplevideosareanalysedbyusingtwoalgorithms:

theformerisusedtosegmentthedifferentscenes overtime(in- traﬂowanalysis).Thelatterisemployedtoperformthegroupingof thevideosrelatedtotheinvolveddevicesovertime.Asreportedin theexperimentalresultsof[13],theintraﬂowanalysisofRECfusion sufferswhenappliedtoegocentricvideosbecausetheyarehighly unstableduetotheuser’smovements.

The framework proposed in this paper allows to have better performancesforegocentricvideosorganization,onbothsegmen- tation accuracy and computational costs. The proposed method takesasetofegocentricvideosregardingthedailylivingofauser amongdifferentdays,andperformsanunsupervisedsegmentation ofthem. The obtained video segments among different days are then organizedby contents. The video segments of the different dayssharingthe same contentsare then visualized by exploiting aninteractiveweb-baseduserinterface.Inourframeworkwe use auniquerepresentationfortheframeswhichisbasedonCNNfea- tures[22](forbothintraﬂow andbetweenﬂowsanalysis)instead ofthetwo differentrepresentations basedon SIFTandcolorhis- togramsasproposedinRECfusion.Experimentsshowthatthepro- posedframeworkoutperformsRECfusionfordailylivingegocentric videoorganization.Moreover,theapproachobtainsbetteraccuracy thanRECfusion forthepopularityestimationtaskforwhichREC- fusionhasbeendesigned.

Therestofthepaperisorganizedasfollows.Section2presents thedesignedframework.Section3describestheconsideredwear- abledataset.Thediscussionontheobtainedresultsisreportedin Section4.InSection5thedevelopedsystemiscomparedwithre- spect to RECfusion [13] on mobile videos. FinallySection 6 con- cludesthepaperandgivesinsightsforfurtherworks.

2. Proposedframework

The proposed framework performs two main steps on the videos acquired by a wearable camera: temporal segmentation andsegmentorganization. Fig.1showstheschemeoftheoverall pipelinerelatedtotheoursystem.

Starting froma set ofegocentric videosrecordedamong mul- tipledays(Fig.1(a)),the firststep performsan intraflow analysis ofeach video to segment it with respect to the differentscenes observedbytheuser(Fig.1(b)).Eachvideo isthensegmentedby employingtemporalandvisualcorrelationsbetweenframesofthe samevideo (Fig. 1(b):the colourofeach block identifies a scene observed within the same video). Then, the segments obtained overdifferentdaysreferred to thesame contentsare groupedby

meansofabetweenﬂowsanalysiswhichimplementsanunsuper- vised clustering procedure aimed to semantically associate video segments obtained fromdifferent videos (Fig. 1(c):the colour of each block nowidentiﬁes video segments associatedto thesame visualcontent,observedindifferentdays).Thesystemhencepro- ducessets of videoclips related toeach location where theuser performsdailyactivities(e.g.,thesetoftheclips overdayswhere the userwashes dishes inthe kitchen, the setrelated tothe ac- tivityoftheuserofplayingpiano,andsoon).Theclips areorga- nized taking into account both, visual andtemporal correlations.

Finally,theframeworkprovidesawebbasedinterfacetoenablea browsingoftheorganizedvideos(Fig.1(d)).Inthefollowingsub- sectionsthedetailsonthedifferentstepsinvolvedintothepipeline aregiven.

2.1. Intraﬂowanalysis

Theintraflowanalysisperformstheunsupervisedtemporalseg- mentation of each input video, as well as associates a scene ID to each video segment. Segments with the same content have to share the same scene ID. This problem has been addressed in[13]forvideosacquiredwithmobile devices.Tobetter explain theproblem,tnthefollowingwefocusouranalysisontheissues relatedtotheintraflowalgorithmdetailedin[13]whenappliedon first-personvideos.Thisisusefultointroduce themainproblems ofaclassicfeaturebasedmatchingapproachfortemporalsegmen- tationin wearabledomain. Then we presentoursolution forthe intraflowanalysis.Furthermore,inAppendixAthepseudocodede- scribingtheproposedintraflowanalysisapproachisreported.

2.1.1. IssuesrelatedtoSIFTbasedtemplates

Theintraflowanalysisusedin[13]comparestwoscenesconsid- eringthenumberofmatchings betweenareferencetemplateand theframeunderconsideration(currentframe).Thescenetemplate is aset ofSIFT descriptorsthat must accomplish specific proper- tiesof“reliability”.Whenthealgorithmdetectsasuddendecrease in the number of matchings, it refreshes the reference template extractinganewsetofSIFTsandsplitsthevideo. Inordertode- tectsuch changes,thesystemcomputesthevalue oftheslopein thematchingfunction(i.e.,thevariationofthenumberofmatch- ings in a range interval). When the slope is positive and over a threshold(whichcorrespondto asuddendecreaseofthenumber ofmatchings betweenthe SIFTdescriptors)thealgorithm finds a newtemplate(i.e.,anewblock).Whenanewtemplateisdefined, it iscompared withthe pasttemplates in orderto understandif itregardsanewsceneoritisrelatedtoa knownone(backward searchphase). Althoughthismethodworksverywell withvideos acquiredwithmobile cameras, it presentssome issues when ap- pliedonvideosacquiredwithwearabledevices.Insuchegocentric videos,thecameraisconstantlymovingduetotheshakeinduced bythenaturalheadmotionofthewearer.Thiscausesacontinuous refreshofthereferencetemplatethatisnotalwaysmatchedwith similarscenesduringthebackwardsearch.Hence,themethodpro- ducesanoversegmentation ofthe egocentricvideos.Furthermore, itrequirestoperformseveralSIFTdescriptorextractionandmatch- ing operations(including geometricverifications) to exclude false positive matchings. This have a negative impact on the realtime performances.ThefirstrowofFig.2showstheGroundTruthseg- mentation of the video acquired with a wearable camera in an home environment.¹ Anexample ofatemporal segmentationob- tainedwiththeSIFTbasedinterflowanalysisproposedin[13]on a egocentric video is reported in the second row of Fig. 2. The

1 The video related to the example in Fig. 2 is available for visual inspection at the URL http://iplab.dmi.unict.it/dailylivingactivities/homeday.html .

(3)

Fig. 1. Overall scheme of the proposed framework.

Fig. 2. Output of the intraflow analysis using SIFT and CNN features when applied to the video Home Day 1 . The first row is the Ground Truth of the video. The second row shows the result of the intraflow analysis with the method discussed in [13] . The third row shows the result of the intraflow analysis of our method, whereas the last two rows show the results of our method with the application of the proposed scene coherence and duration filtering criteria.

algorithm works well when the scene is quite stable (e.g.,when theuseriswatchingaTVprogram),butitperformsseveralerrors whenthesceneishighlyunstabledueheadmovements.Infact,in themiddleofthevideorelatedtoFig.2,theuseriswashingdishes atthesink,andheiscontinuouslymovinghishead.Inthisexam- ple the intraﬂow approachbased on SIFTfeatures detects a total of 8different scenes instead of3. The algorithmcannot ﬁndthe matchings betweenthecurrentframeandthereferencetemplate duetotwomainreasons:

1. when the video is unstable, even though the scene content doesn’t change, the matchings betweenlocal features are not reliableandstablealongtime;

2. Inaclosedspacesuch asanindoor environment,thedifferent objects ofthescene canbe very closeto theviewer.Hence a smallmovementoftheuser’sheadisenoughtocauseanhigh numberofmismatchesbetweenlocalfeatures.

2.1.2. CNNbasedimagerepresentation

To deal withthe issues describedin the previous section, we exploitanholisticfeaturetodescribethewholeimageratherthan anapproachbasedonlocalfeatures.Inparticular,intheintraflow analysis we representframes by using features extracted with a ConvolutionalNeuralNetwork(CNN)[23].Specifically,weconsider the CNN proposed in [22] (AlexNet). In our experiments, we ex- ploittherepresentationobtainedconsideringtheoutputofthelast hiddenlayerofAlexNet,whichconsistsofa4096dimensionalfea- ture vector (fc7 features).We decided to useAlexNet representa- tion since it hasbeen successfully used as a general image rep- resentationforclassificationpurposeinthelastfew years[24,25]. Moreover,thefeaturesextractedbyAlexNethavebeensuccessfully usedfortransferlearning[26–28].Finally,AlexNetarchitectureisa smallnetworkcomparedtoothers(e.g.,VGG[29]).Thus,itallows toperformthefeatureextractionveryquickly.Theproposedsolu- tioncomputesthesimilaritybetweenscenesby comparingapair offc7features withthecosinesimilaritymeasure.Thecosinesim- ilarity of two vectors measures the cosine of the angle between them. Thismeasure isindependent ofthe magnitudeofthe vectors, andis well suitedto comparehigh dimensionalsparse vectors,such asthefc7features. Thecosinesimilarityofthetwo fc7

featurevectorsv₁andv₂ iscomputedasfollowing:

CosSimilarity(

v

1,

v

2)⁼

v

1·

v

2

v

1

v

2

⁽¹⁾

2.1.3. Proposedintraﬂowanalysis

Duringtheintraﬂowanalysis,theproposedalgorithmcomputes thecosine similaritybetweenthecurrentreferencetemplate and thefc7featuresextractedfromtheframesfollowingthereference.

Whenthealgorithmdetectsasuddendecreaseinthecosinesim- ilarity,itrefreshesthereferencetemplateselectinganewfc7fea- ture corresponding to a stable frame. As in [13], to detect such changesthesystemcomputesthevalueoftheslope(i.e.,thevari- ationofthecosinesimilarityinarangeinterval).When theslope hasapositivepeak(whichcorrespondtoasuddendecreaseofthe cosinesimilarity)thealgorithmﬁndsanewtemplateandproduces anewvideosegment.Therearetwo casesinwhichtheintraﬂow analysiscomparestwotemplates:

1.a template is compared with the features extracted from the forwardframes,whenthealgorithmhavetocheckitseligibility tobeareferencetemplatefortheinvolvedscene;

2.A template is compared with the past templates during the backwardcheckingphasetoestablishthesceneID.

Intheﬁrst case, theelapsed timebetweenthe two compared framesdependsonthesamplingrateoftheframes(inourexperiments,wesampledatevery20framesforvideosrecordedat30 fps).Differently,theframescomparedduringthebackwardcheck- ing could be rather different due to the elapsed time between them.Forthis reason, whenwe compare anew templatewith a past template,we assign the templates to the same sceneID by usinga weakerconditionwithrespecttotheoneusedinthefor- wardveriﬁcation.Intheforwardprocess,thealgorithmassignsthe samesceneIDtothecorrespondingframesifthecosinesimilarity betweentheir descriptors is higher thana threshold T_f (equal to 0.60inourexperiments).Whenthealgorithmcomparestwotem- plates in the backward process, it assign the same scene to the corresponding framesif the similarityis higher than a threshold T_b(equalto0.53inourexperiments).

Besidestheimagerepresentation,ourintraﬂowalgorithmintro- ducestwoadditionalrules:

(4)

1.In[13]eachvideoframeisassignedtoasceneIDoritisclassi- ﬁedasNoiseandhencerejected.Ourapproachisabletodistin- guishtherejectedframeamongtheonescausedby themove- mentoftheheadoftheuser(HeadNoise)andtheframesre- latedtothetransitionbetweentwoscenes inthevideo (Tran- sition). Whena newtemplate isdeﬁnedafteragroup ofcon- secutive rejectedframes, theframesbelonging tothe rejected groupareconsideredas“Transition” ifthedurationoftheblock is longer than 3 s (i.e.,head movements are fasterthan 3 s).

Otherwisetheyareclassiﬁedas“HeadNoise”.Incaseofnoise, thealgorithmdoesn’tassign anewsceneID totheframethat follows the “HeadNoise” video segment because the noise is relatedtoheadmovements,buttheuserisinthesamecontext.

When theuserchangesits locationhechangeshis positionin theenvironment.Thus,thetransitionbetweendifferentscenes involvesalongertimeinterval.

2. The second rule is related to the backward veriﬁcation.

In [13] it is performedstarting from the last found template andproceeding backward.The search stops whenthe process ﬁnds the ﬁrstpast template that havea good matching score withthenewtemplate.Suchapproachisquiteappropriate for the method in[13] becauseit compares sets of localfeatures andreliesonthenumberofachievedmatchings.Theapproach proposed in this paper compares pairs of feature vectors in- steadofsetsoflocaldescriptors,andselectsthepasttemplate thatyieldsthebestsimilaritytothenewone.Inparticular,the methodcomparesthenewtemplatewithallthepreviousones, consideringallthepasttemplatesthatyieldsacosinesimilarity greater than T_b.From thisset of positivecases, thealgorithm selectstheonethatachievesthemaximumsimilarity,evenifit isnotthemostrecentinthetime.

Considering the example in Fig. 2, the segmentation results achieved with the proposed intraﬂow approach (third row) are muchbetterthantheonesobtainedusingSIFTfeatures[13](secondrow).

After the discussed intraﬂow analysis a segmentation reﬁne- mentisperformedasdetailedinthefollowingsubsection.

2.1.4. Intraﬂowsegmentationreﬁnement

Starting fromtheresultoftheproposedintraflow analysis(see thethirdrowofFig.2), wecan easilydistinguishbetween“Tran- sition” and“Noise” blocksamongall therejectedframes. Ablock ofrejectedframesisa“Noise” block ifboththeprevious andthe nextdetectedscenesarerelatedtothesamevisualcontent,other- wiseitismarkedasa“Transitionblock”. Werefertothiscriteria asSceneCoherence.Theresultofthisstepontheexampleconsid- eredin Fig. 2 is shown in the fourthrow. When comparing the segmentationwiththeGroundTruth(firstrow),theimprovement withrespectto [13](secondrow) isevident. Moreover, manyer- rorsof the proposed intraflow analysis (third row) are removed.

Thesecondstepofthesegmentationrefinementconsistsinconsid- eringtheblocksrelatedtouseractivityinalocationwithadura- tionlongerthan30s(DurationFiltering),unlesstheyarerelatedto thesamesceneoftheprevious block(i.e.,afteraperiodofnoise, thesceneisthesameasbeforebutithasalimitedduration).We appliedthis criteriabecause, inthe context of Activities ofDaily Living(ADL),we areinterestedtodetecttheactivitiesofaperson inalocationthathaveasignificantdurationinordertobeableto observethebehavior.ThisrefinementstepfollowstheSceneCoher- enceone.The finalresultonthe consideredexample isshownin thelastrowofFig.2.Despitesomeframesareincorrectlyrejected (duringscenechanges)theproposedpipelineismuchmorerobust than[13](comparefirst,secondandfifthrowsofFig.2).Thisout- comeisquantitativelydemonstratedintheexperimentalsectionof thispaperonadatasetcomposedby13egocentricvideos.

2.2. Betweenvideoﬂowsanalysis

Wheneachegocentricvideoissegmented,thesuccessivechal- lengeis todetermine whichvideo segments among thedifferent daysare related tothe samecontent. Givena segment block b_v_A extracted from the egocentric video v_A, we compare it with re- specttoallthesegmentblocksextractedfromtheotheregocentric videosvB_j.Torepresenteachsegment,weconsideragaintheCNN fc7featuresextractedfromoneoftheframesofthevideosegment.

Thisframeisselectedconsideringthetemplatewhichachievedthe longerstabilitytime duringtheintraﬂow analysis.Foreachblock b_v_A,ourapproachassignstheIDofb_v_A (obtainedduringintraﬂow analysis)toalltheblocksb_v_B

j extractedfromtheother egocentric videosvB_j suchthat

b_vBj =argmax

b_v_B j∈vBj

{

CosSimilarity(^bvA,b_vBj)

| σ

(^bvA,vBj)≥ T_σ

} ∀ v

Bj (2)

where σ(bv_A,vBj) is the standard deviation of the cosine similar- ityvaluesobtained by consideringthe segment block b_v_A andall the blocks of vB_j, and T_σ is the decision threshold. This procedure isperformedforallthe segmentblocksb_v_A ofthevideo v_A. When all the blocks of vA have been considered, the algorithm takesintoaccount a videoofanother dayandthematchingpro- cessbetweenvideosegmentsofthedifferentdaysisrepeatedun- tilall thevideosegmentsinthepool areprocessed.Inthiswaya sceneIDisassignedtoall theblocksofalltheconsidered videos.

The pairs ofblockswith thesame ID areassociated to thesame scene,eveniftheybelongtodifferentvideos,andallthesegments areconnectedinagraphwithmultipleconnectedcomponents(as in Fig. 1(c)). When there is a high variabilityin the cosine similarity values (i.e.,the value of σ(^bv_A,v_B

j) is high), the system assigns the sceneID tothe segment block that achievedthe max- imum similarity. When a block matches withtwo blocks related totwodifferentsceneIDs,thesystemassignsthesceneIDrelated tothe blockwhich achievedthehighestsimilarityvalue.When a block isn’t matched,itmeans thatall the similarityvaluesofthe miss-matchedblocks aresimilar.Thiscauseslow valuesofσ ^and

helpsthesystemtounderstandthat thesearchedsceneID isnot present(i.e.,scenes withonlyone instanceamongalltheconsid- eredvideos). Tobetter explain the proposed approach,the pseu- docoderelatedtotheabovedescribedbetweenvideoﬂowsanaly- sisisreportedinAppendixA.

3. Dataset

Todemonstratetheeffectivenessoftheproposedapproach,we have considered a set ofrepresentative egocentric videosto per- formtheexperiments.Theegocentric videosare relatedtodiffer- ent days and have been acquired using a Looxcie LX2 wearable camerawitharesolutionof640× 480pixels.Thedurationofeach videoisabout10min.Thevideosarerelatedtothefollowingsce- narios:

• Home Day: a set of four egocentric videos taken in a home environment.Inthisscenario, theuserperforms typicalhome activities such as cooking, washing dishes, watching TV, and playing piano. This set of videos has been borrowed from the 10contexts dataset proposed in [19] which is available at the following URL: http://iplab.dmi.unict.it/PersonalLocations/

segmentation/.

• OﬃceDay:asetofsixegocentricvideostakeninahomeand in different oﬃce environments. This set of videos concerns severalactivities performedduringa typicalday. Alsothisset ofvideosbelongstothe10contextsdataset[19].

(5)

Table 1

Intraﬂow performances on the considered egocentric video dataset obtained using [13] and the proposed approach. Each test is evaluated considering the accuracy of the temporal segmentation (Q), the computation time (Time) and the number of the scenes detected by the algorithm (Scenes). The accuracy is measured as the percentage of correctly classiﬁed frames with respect to the Ground Truth. The measured time includes the feature extraction process.

Intraflow proposed in [13] Proposed Intraflow Proposed Interflow Approach Approach with Segmentation Refinement

Video Scenario Scenes Q Scenes Time Q Scenes Q Scenes Time

1 HomeDay1 3 62.5% 8 20’45” 77.5% 4 92.5% 3 1’23”

2 HomeDay2 3 71.6% 3 20’18” 80.3% 4 94.5% 3 1’46”

3 HomeDay3 3 64.3% 5 19’03” 79.7% 5 94.3% 3 1’21”

4 HomeDay4 3 84.4% 3 8’36” 91.8% 3 85.4% 2 36”

5 WorkingDay1 4 95.7% 5 16’16” 98.4% 5 99.5% 4 1’22”

6 WorkingDay2 4 82.5% 5 15’15” 98.9% 5 100% 4 1’08”

7 WorkingDay3 5 98.7% 6 19’02” 99.2% 6 99.4% 5 1’29”

8 OﬃceDay1 3 23.0% 5 24’8” 55.3% 19 66.9% 2 2’39”

9 OﬃceDay2 2 59.7% 2 10’25” 90.0% 3 98.7% 2 1’26”

10 OﬃceDay3 3 57.2% 4 13’28” 83.6% 10 96.3% 3 1’49”

11 OﬃceDay4 3 52.0% 4 11’37’ 79.5% 5 84.1% 4 1’41”

12 OﬃceDay5 3 70.7% 3 8’35” 86.7% 4 95.9% 3 1’21”

13 OﬃceDay6 3 78.8% 3 9’33” 61.5% 5 94.5% 4 1’34”

Average 69.4% 15’9” 83.3% 91.9% 1’31”

• WorkingDay:asetofthreevideostakeninalaboratoryenvi- ronment.Theactivities performedbytheuserinthisscenario regardsreadingabook,workinginalaboratory,sittinginfront ofacomputer,etc.

Eachvideo hasbeenmanuallysegmentedtodeﬁne theblocks of framesto be detected in the intraﬂow analysis.Moreover, the segmentshavebeenlabeledwiththesceneIDtobuildtheGround Truthforthebetweenvideoanalysis.TheGroundTruthisusedto evaluate the performances of theproposed framework. The used egocentricvideos,aswellastheGroundTruth,areavailableatthe followingURL:http://iplab.dmi.unict.it/dailylivingactivities/. 4. Experimentalresults

In this section we report the temporal segmentation and the between ﬂows video analysis results obtained on the considered dataset.

4.1. Temporalsegmentationresults

Table 1 shows the performances of the proposed temporal segmentation method (see Section 2.1). We compared our solu- tion withrespecttothe oneadopted byRECfusion [13].Foreach methodwecomputedthequalityofthesegmentationastheper- centage of the correctlyclassified frames(Q), the numberof de- tectedscenesandthecomputationaltime.²Theproposedapproach obtains strong improvements (up to over 30%) in segmentation quality withrespect to RECfusion (e.g., resultsat eight and nine rows in Table 1). Furthermore, the application of the segmentation refinements provides improvements up to43% in segmentation quality (results at row eight in Table 1). In the fourth row of Table 1 (related to the analysis of the video Home Day 4), we can observe that the application of the segmentation refine- mentscausesadecreaseinperformances.Thisvideoisveryshort compared to the other videos of the dataset. It has a duration of just4’28” and consists of a sequence of three different scenes (piano, sink and TV). The sceneblocks are correctlydetected by the proposed intraflow approach, which finds exactly 3 different scenesandachievesa91,8%ofaccuracywithoutrefinement.How- ever, themiddle scene(sink) hasa duration ofjust 22s accord- ingtotheGroundTruthusedin[19],thustherefinementprocess

2 The experiments have been performed with Matlab 2015a, running on a 64-bit Windows 10 OS, on a machine equipped with an Intel i7-3537U CPU and 8GB RAM.

rejects this block due to the applicationof the Duration Filtering criteria.Consideringthe meanperformances(lastrowinTable1) oursystemachievesanimprovementofover14%withoutsegmen- tationrefinements, withover 22% of margin after thesegmentation refinement. The proposed method also reduces the computational time ofmore than 21 minin some cases(eighth rowin Table 1). It has an average computational time saving of about 13 min withrespect to the compared approach [13]. The results ofTable 1 show that the application ofthe Scene Coherence and theDurationFilteringcriteriausedinthesegmentation refinement step(Section2.1.4)allowstodetectthecorrectnumberofscenes.

Insum,considering thequalitativeandquantitative resultsre- portedrespectivelyin Fig.2 andTable 1,the proposed systemis demonstratedtoberobust forthetemporalsegmentationofego- centric videos, and it provides high performances with a lower computationaltimewithrespectto[13].

4.2.Betweenﬂowsvideoanalysisresults

Inourexperiments,all the segmentsrelatedto theHome Day andWorkingDayscenarios havebeen correctlyconnectedamong thedifferent days withouterrors. Fig. 3shows two timelines related to the videos of the scenario Working Day, whereas Fig. 4 showsthe timelines related to the scenario Home Day. The ﬁrst timelineinFigs.3and4showstheGroundTruth labeling.Inthe timeline, the black blocksindicate the transitionintervals (to be rejected). The second timeline shows the result obtainedby our framework. Inthiscase, the blackblocks indicate theframes au- tomaticallyrejectedby thealgorithm.Inordertobetterasses the resultsobtainedby the proposedsystem, the readercan perform a visual inspection of all the results produced by our approach atthefollowing URL: http://iplab.dmi.unict.it/dailylivingactivities/. Throughthewebinterfacethedifferentsegmentscanbeexplored.

DifferentlythantheHomeDayandWorkingDayscenarios,some matching erroroccurred inthe between flow analysisof the Of- ficeDayscenario (seeFig. 5).Since we aregroupingvideo blocks by contents, to better evaluate the performance of the proposed betweenflowanalysis weconsidered three qualityscores usually usedinclusteringtheory.The simplestclusteringevaluationmea- sureisthePurityoftheclustering:eachclusterofvideosegments obtainedafterthe betweenflow analysisis assignedto themost frequentclassinthecluster.ThePuritymeasureishencethemean ofthenumberofcorrectlyassignedvideoblockswithin theclus-

(6)

Fig. 3. The two timelines show the Ground Truth segmentation of egocentric videos related to the Working Day scenario and their organization (top). The inferred segmen- tation and organization obtained by the proposed method is reported in the bottom.

Fig. 4. The two timelines show the Ground Truth segmentation of egocentric videos related to the Home Day scenario and their organization (top). The inferred segmentation and organization obtained by the proposed method is reported in the bottom.

Fig. 5. The two timelines show the Ground Truth segmentation of egocentric videos related to the Office Day scenario and their organization (top). The inferred segmentation and organization obtained by the proposed method is reported in the bottom. In this case, the blocks related to the scenes ‘office’, ‘studio’ and ‘laboffice’ are clustered in the same cluster (identified by the color red), with the exception of only one ‘laboffice’ block.

(7)

Table 2

Between video clustering results.

Dataset # of Videos Original GT New GT

Purity RI F 1 F 2 Purity RI F 1 F 2

WorkingDay 3 1 1 1 1 – – – –

HomeDay 4 1 1 1 1 – – – –

OﬃceDay 6 0.68 0.81 0.49 0.57 0.95 0.89 0.80 0.75

Oﬃce + Home 10 0.58 0.83 0.41 0.54 0.74 0.87 0.61 0.67

ters.

Purity= 1 N

k

i=1

max_j

|

^Ci∩T_j

|

⁽³⁾

whereNisthenumberofvideoblocksinthecluster,kisthenum- berofclusters,C_iistheithclusterandT_jisthesetofelementsof the class j that are present in thecluster C_i. The clustering task (i.e.,thegroupingperformedbythebetweenﬂowanalysisinour case) can beviewed asa seriesofdecisions, one foreach ofthe N(^N− 1)/2pairsoftheelementstobeclustered[30].

Thealgorithmobtains atruepositive decision(TP)ifitassigns two videosofthesameclass tothesame cluster,whereas atrue negativedecision(TN)ifitassignstwovideosofdifferentclassto different clusters. Similarly, a false positive decision (FP) assigns two differentvideosto thesame clusteranda false negativedecision (FN)assigns two similar videosto different clusters. With theaboveformalizationwecancompute theconfusionmatrixas- sociatedtothepairingtask.

From the confusion matrix we compute the Rand Index (RI), which measures the percentage ofthe correct decisions (i.e.,the pairingaccuracy)ofthebetweenﬂowanalysis:

RI= TP+TN

TP+FP+FN+TN (4)

In order to take into account both precision and recall, we also consideredtheF_β measuredeﬁnedasfollowing:

F_β=(¹+

β

²)· Precision· Recall

β

²^P^recision⁺^Recall ⁽⁵⁾

Specifically, we considered the F₁ measure that weights equally precision and recall, and the F₂ measure, which weights recall higherthanprecisionand,therefore,penalizesfalsenegativesmore than false positives. Table 2 shows the results of the proposed between flow analysis approach considering the aforementioned measures. For the Home Day andWorking Day scenarios our ap- proachachievesthemaximumscoresforalltheconsideredevalu- ation measures (column “Original GT” of Table 2). Regarding the Office Day scenario, we obtain lower scores of Purity and Rand Index. The obtained values of F₁ and F₂ measures indicate that theproposedbetweenflowapproachachieveshigherrecallvalues thanprecision.ThiscanbefurtherverifiedbyobservingtheFig.6. This figure shows theco-occurrence matrix obtained considering the OfficeDay scenario.The columns ofthe matrixrepresentthe computedgraphcomponents(clusters)obtainedwiththebetween flowanalysis,whereaseachrowrepresentsasceneclassaccording to the Ground Truth. The values of this matrix express the percentage of video blocks belongingto a specific class that are assigned to each graphcomponent (accordingto the Ground Truth used in [19]). This figure showsthat even ifdifferent blocks are included in the samegraph component (FP), the majority ofthe blocksbelongingtothesameclassareassignedtothesamegraph component(TP).Thesecondcolumnoftheco-occurrencematrixin Fig.6showsthatthe“laboffice”,“office” and“studio” blockshave been assigned to the graph component C2. This error is due to strongambiguityinthevisualcontentofthesethreeclasses.Fig.8 showssomeexampleoftheframesbelongingtotheseclasses.We

Fig. 6. Co-occurrence matrix obtained considering the Oﬃce Day scenario.

Fig. 7. Co-occurrence matrix obtained considering the Oﬃce Day scenario with the

“New GT”.

can observe that these scenes are all related to the same activity(i.e., working ata PC) and the visual content is very similar.

WerepeatedthetestsconsideringanalternativeGroundTruththat considersthethreeabovementionedsceneclassestobeaunique class(Laboffice).Theresultsofthistestarereportedinthecolumn labeled as “New GT” of Table 2. In this case the OfficeDay sce- narioresultachieves significant improvementsforall the consid- eredevaluationmeasures. Fig. 7 showsthe co-occurrence matrix obtainedconsideredthenewGroundTruth.

ThelastrowofTable2reportstheresultsobtainedbyapplying theproposed approachon the setof 10videos obtainedby con- sideringonlytheHomeDayandtheOﬃceDaysetsofvideos.This testhavebeendonetoanalysetherobustnessoftheproposedap- proachconsideringmorecontextsandalsoscenesthatappearonly insome videos.Furthermore,to betterassessthe effectivenessof the proposed approach, we performed a number of incremental testsby varyingthe numberof inputvideosfrom 2to 10videos of the considered set. This correspond to consider from 2 to 10 daysofmonitoring.The obtainedscores reportedinFig. 9shows that the evaluationmeasures are quite stable. We alsoevaluated theperformanceoftheproposedapproachbyvarying thethresh- oldT_σ value(seeFig.10).Alsointhiscasetheresultsdemonstrate thatthemethodisrobust.Wealsotestedthebetweenanalysisap-

(8)

Fig. 8. Some first person view images from the Office Day scenario depicting frames belonging to the classes “laboffice”, “office” and “studio”.

Fig. 9. Evaluation measures at varying of the number of videos.

Fig. 10. Evaluation measures obtained by varying the threshold T σ.

proachbyusingthecolorhistogramandthedistancefunctionused inRECfusion[13].Whenthesystemusessuchapproachthereisa highvarianceintheobtaineddistancevaluesevenifthereisn’tany segmentationblock tobematched amongvideos.Thiscausesthe matchingbetweenuncorrelatedblocksandhenceerrors.

5. Popularityestimation

Considering the improvements achieved by the proposed methodinthecontextofegocentricvideos,wehavetestedtheap- proachtosolvethepopularityestimationproblemdeﬁnedin[13]. Thepopularityoftheobservedsceneinaninstantoftimedepends onhowmanydevicesarelookingatthatscene.Inthecontext of organizing the egocentric videos the popularity can be usefulto estimatethemostpopularsceneoverdays.

Toproperlytestourapproachforpopularityestimationwehave consideredthedatasetintroduced in[13].Thisallows afaircom- parisonwiththestate oftheartapproachesforthisproblembe-

Table 3

Experimental results on the popularity estimation problem.

[13] Proposed method

Dataset dev Pa / Pr Pg / Pr Po / Pr Pa / Pr Pg / Pr Po / Pr

Foosball 4 1.023 1 0.023 1 1 0

Meeting 2 1.011 0.991 0.020 0.991 0.991 0 Meeting 4 0.988 0.955 0.033 0.930 0.930 0 Meeting 5 0.886 0.755 0.131 0.698 0.704 0.006 SAgata 7 1.049 1 0.050 0.989 0.989 0

causetheresultscan bedirectlycompared withthe onesin[13]. Differentlythan[13],afterprocessingthevideoswiththeintraﬂow analysisandsegmentation reﬁnementproposed inthispaper, we haveusedtheproposedCNNfeaturesandtheclusteringapproach of[13].Toevaluatetheperformancesofthecomparedmethodswe usedthreemeasures.Foreachclusteringstepwecompute:

• Pr: groundtruth popularityscore(number ofcameraslooking atthemostpopularscene)obtainedfrommanuallabelling;

• Pa:popularityscorecomputedbythealgorithm(numberofthe elementsinthepopularcluster);

• Pg:numberofcorrectvideosinthepopularcluster(inliers).

From the above scores,the weightedmeanof theratios Pa/Pr

andPg/Proveralltheclusteringstepsarecomputed[13].Theratio Pa/Pr providesa score forthe popularityestimation,whereas the ratioPg/Prassessesthevisualcontentofthevideosinthepopular cluster. Thesetwo scoresfocus onlyon thepopularityestimation andthe number ofinliersin the mostpopular cluster. Since the aim is to infer the popularity of the scenes, it is useful to look alsoatthenumberofoutliersinthemostpopularcluster.Infact, theresultsreported in[13]showthat whenthe algorithmworks with a low number of input videos, the most popular cluster is sometimesaffectedbyoutliers.Theseerrorscouldaffectthepop- ularity estimation ofthe clustersand, therefore,the ﬁnal output.

Thus,weintroducedathirdevaluationmeasurethattakesintoac- count the number ofoutliers in themost popularcluster. LetPo

be the numberof wrongvideosin the popularcluster (outliers).

Fromthisscore,wecomputetheweightedmeanoftheratioPo/Pr

over all theclusteringsteps, wherethe weights are givenby the lengthofthesegmentedblocks(i.e.,theweights arecomputedas suggestedin[13]).Thisvaluecanbeconsideredasapercentageof thepresenceofoutliersinthemostpopularclusterinferredbythe system.Theaimistohavethisvalueasloweraspossible.

Table3showstheresultsofthepopularityestimationobtained by the compared approaches.The ﬁrst columnshowsthe results obtainedbytheapproachproposedin[13].Althoughthismethod achievesgoodperformance intermsofpopularityestimationand inliersrate, the measurePo/Pr highlights that themethod suffers from the presence of outliers in the most popular cluster. This meansthatthepopularityratio(Pa/Pr) isaffectedbythepresence ofsomeoutliersinthemostpopularcluster.Indeed,inmostcases

(9)

Fig. 11. Examples of clustering performed by our system and the approach in [13] . Each column shows the input frames taken from different devices. The color of the border of each frame identiﬁes the cluster. The proposed method has correctly iden- tiﬁed the three clusters.

the Pa/Pr value is higherthan 1 andPo/Pr is greater than 0.The second column ofTable 3is relatedto the resultsobtained with theproposed approach.We obtainedvaluesofpopularityestima- tionandinliersratecomparablewithrespectto[13].Inthiscase, the valuesofpopularity scoreare all lower than1, whichmeans thatsometimestheclusteringapproachlostsomeinliers.

However it worth to notice that for all the experiments per- formedwiththeproposedapproach,theoutlierratioPo/Prisvery closetozeroandlowerwithrespecttothevaluesobtainedbyOr- tis etal.[13].This meansthat the mostpopularcluster obtained by theproposedapproachisnot affectedbyoutliers,andthisas- sureusthattheoutput videobelongstothe mostpopular scene.

Byperformingavisualassessmentoftheresults,weobservedthat usingthe proposedCNN featuresinthe clusteringphaseinvolves a fine-grained clustering, which better isolates the input videos duringthetransitionsbetweentwoscenesorduringa noisetime interval. This behaviour is not observed in the outputs obtained by [13]. Using the color histogram indeed, the system defines a limited number of clusters that are, therefore, affected by outliers. Some examples about this behaviour are shown in Fig. 11 andFig.12.Inthesefigures,each columnshowstheframestaken from the considered devices at the same instant. The border of each frame identifies the cluster. The first columnof each figure shows the result ofthe proposed approach, and the second col- umnshowstheresultofthemethodproposedin[13].Specifically Fig.11showsanexampleofclusteringperformedbythecompared approachesduringtheanalysisofthescenarioFoosball.Inthisex- ample,thefirstandthethirddevicesareviewingthesamescene (the library),the second device is viewingthe sofa, whereas the fourthdevice is performing atransition betweenthetwo scenes.

In such case, theproposed methodcreates adifferent cluster for the devicethat isperforming thetransition, whereas themethod proposedin[13]includesthesceneinthesameclusterofthesec- onddevice. Fig.12 showsan exampleofthepopularityclustering performedduringtheanalysisofthescenarioMeeting.Inthiscase both the approachesfail to insert the second frame in the pop-

Fig. 12. Examples of clustering performed by our system and the approach in [13] . Each column shows the input frames taken from different devices. The color of the border of each frame identiﬁes the cluster. The proposed method produces better results in terms of outlier detection.

ular cluster dueto the huge difference inscale, but the method in[13]createsaclusterthat includesthesecondandtheﬁfthde- vice(despite they are viewingtwo differentscenes) whereas the proposedmethodcandistinguishthesecondimage fromtheﬁfth one.

Besides the improvement in terms of performance, proposed approachprovides alsoanoutstandingreduction ofthecomputa- tiontime.Infact,thetime neededtoextractacolorhistogramas suggestedin[13] isabout1.5s, whereas thetime needed toex- tracttheCNNfeatureusedinourapproachisabout0.08s.Regard- ingthepopularityintraﬂow analysisonthemobile videodataset, theproposedapproachachievessimilarsegmentationresultscom- pared to the SIFT based approach. However, as in the wearable domaintest, theproposed segmentation methodstrongly outperforms [13] when the system have to deal with noise (e.g., hand tremors).

6. Conclusionsandfutureworks

Thispaperproposesacompleteframeworktosegmentandor- ganizeasetofegocentric videosofdailylivingscenes.Theexper- imental results show that the proposed system outperforms the state of the art in terms of segmentation accuracy and computational costs, both on wearable and mobile video datasets. Fur- thermore,ourapproachobtainsanimprovementonoutliersrejec- tiononthepopularityestimationproblem[13].Futureworkscan considerto extendthesystemtoperform recognitionofcontexts from egocentric images [2,31–33] and to recognize the activities performedbytheuser[14].

(10)

AppendixA. Pseudocodes

The Algorithm1 describestheintraﬂow analysisprocess (see

Data:InputVideo

Result:SegmentationVideo C←newTemplateAt(⁰)^; SEEK←True;

t←0;

whilet≤videoLength− stepdo t←t+step;

f c7t←extractFeatureAt(^t)^; ifSEEKthen

sim←cosSimilarity(^C. f c7,f c7_t)^; ifsim<T_f then

C←newTemplateAt(^t)^; else

ifCisastabletemplatethen SEEK←False;

TS←TS{^C}^;

T←C;

T.scene_id←sceneAssignment()^; end

end else

sim←cosSimilarity(^T. f c7,f c7_t)^; slope←slopeAt(^t)^;

iftheslopefunctionhasapeakthen SEEK←True;

C. f c7← f c7t; T←C;

end

assignIntervalSceneId(^T.scene_id,t− step,t− 1)^; end

end

applySegmentationRe finements()^;

Algorithm1:IntraﬂowAnalysisPseudocode.

Section2.1)exploitingtheinvolvedvariables andutility functions asreportedinthefollowing:

• A template is representedby a Template Object^. ^It ^keeps

informationaboutthetimeoftherelatedimageframe,thefc7 feature,andthesceneIDassignedduringtheprocedure.

• AnewTemplate Object^is^created^by^the^function^newTem-

plateAt(t). Thisfunction initializes a Template Object with theinformationrelatedtothetimet.

• TheflagSEEKisTrueiftheprocedureisperformingtheresearch ofanewstabletemplate.Inthiscase,theTemplate Object Crepresentsthecurrenttemplatecandidate,whichisassigned tothecurrenttemplateinstanceTifits stabilityhasbeenver- ified accordingto the conditionsdefined in [13]. In thiscase, thenewtemplateinstanceTisaddedtotheTemplatesSetTS. IftheflagSEEKisFalse,theprocedurecomparesthecurrentref- erencetemplateTwiththeforwardframesuntila peakinthe slopesequenceisdetected.

• The function sceneAssignment() performs the scene ID assign- ment aftera new templateis defined. In particular,it assigns Rejected to all the frames inthe interval betweenthe instant whenthenewtemplatehasbeenrequested tothetimewhen ithasbeendefined.Then,thefunctionfindsthesceneIDofthe scene depicted by the new defined template, eventually per- formingthebackwardprocedure(asdescribedinSection2.1.3).

• Afterthesegmentationoftheinputvideo,thereﬁnements de- scribed in Section 2.1.4 are applied by the function applySeg- mentationReﬁnements().

The Algorithm 2 describesthe betweenvideo ﬂows analysis

Data:SetSofSegmentedVideos Result:ClustersoftheSegments initializealllinkstrengthstozero;

foreachvideovA inSdo foreachblockb_v_AinvA do

iflinkStrength[b_v_A]!=0then scene_Id ←b_v_A.scene_Id;

else

assignanewscene_Idtob_v_A; end

foreachvideovB_j inS\{vA}^do

b_v_B

j =argmax_b

v_B j∈v_B

j{CosSimilarity(^bvA,b_v_B

j)| σ(bv_A,v_B

j)≥ Tσ}^ifCosSimilarity(^bv_A,b_v_B

j)>

linkStrength[b_v_B

j]then assignscene_Idtob_v_B

j; linkStrength[b_v_B

j]← CosSimilarity(^bv_A,b_v_B

j)^; end

end end end

Algorithm2:BetweenFlowsVideoAnalysispseudocode.

process(see Section2.2). Thisproceduretakesasinput asetSof videos,previously processedwiththeintraﬂow analysisdescribed inAlgorithm1.Todescribethisprocedurewedeﬁnedadatastruc- ture called linkStrength. This data structure is an array indexed withtheblocksextractedfromtheanalysedvideos.Consideringas exampleablock segmentbv,thevalue oflinkStrength[bv]isequal tozeroiftheblockbvhasnotyetbeenassignedtoasceneID.Oth- erwise,itisequaltothecosinesimilarityvaluebetweenbvandthe matched block whichcaused theassignment. Indeed,all the link strengths areinitialized tozero.Then,ifthesceneID assignedto b_vchanges, thevalue oflinkStrength[b_v] isupdated withthenew cosine similarityvalue, aswell asthescene ID value assignedto bv.

References

[1] D. Damen , T. Leelasawassuk , O. Haines , A. Calway , W.W. Mayol-Cuevas , You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video., in: British Machine Vision Conference, 2014 . [2] A. Furnari , G.M. Farinella , S. Battiato , Recognizing personal contexts from

egocentric images, in: Proceedings of the IEEE International Conference on Computer Vision Workshops - Assistive Computer Vision and Robotics, 2015, pp. 393–401 .

[3] T. Kanade , Quality of life technology, Proc. IEEE 100 (8) (2012) 2394–2396 . [4] A. Furnari , G.M. Farinella , S. Battiato , Recognizing personal locations from ego-

centric videos, IEEE Trans. Hum. Mach. Syst. 47 (1) (2017) 6–18 .

[5] M.L. Lee , A.K. Dey , Lifelogging memory appliance for people with episodic memory impairment, in: The 10th ACM International Conference on Ubiqui- tous Computing, 2008, pp. 44–53 .

[6] V. Buso , L. Hopper , J. Benois-Pineau , P.-M. Plans , R. Megret , Recognition of activities of daily living in natural “at home” scenario for assessment of alzheimer’s disease patients, in: IEEE International Conference on Multimedia

& Expo Workshops, IEEE, 2015, pp. 1–6 .

[7] S. Karaman , J. Benois-Pineau , V. Dovgalecs , R. Mégret , J. Pinquier , R. André-O- brecht , Y. Gaëstel , J.-F. Dartigues , Hierarchical hidden markov model in detecting activities of daily living in wearable videos for studies of dementia, Mul- timed. Tools Appl. 69 (3) (2014) 743–771 .

[8] Y. Gaëstel, S. Karaman, R. Megret, O.-F. Cherifa, T. Francoise, B.-P. Jenny, J.- F. Dartigues, Autonomy at home and early diagnosis in alzheimer’s disease:

Utility of video indexing applied to clinical issues, the immed project, in:

(11)

Alzheimer’s Association International Conference on Alzheimer’s Disease 2011, p. S245.

[9] J. Pinquier , S. Karaman , L. Letoupin , P. Guyot , R. Mégret , J. Benois-Pineau , Y. Gaëstel , J.-F. Dartigues , Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors, in: 21st International Conference on Pattern Recognition, IEEE, 2012, pp. 3192–3195 .

[10] N. Kapur , E.L. Glisky , B.A. Wilson , External memory aids and computers in memory rehabilitation, in: The Essential Handbook of Memory Disorders for Clinicians, 2004, pp. 301–321 .

[11] C. Gurrin , A.F. Smeaton , A.R. Doherty , Lifelogging: personal big data, Found.

Trends Inf. Retrieval 8 (1) (2014) 1–125 .

[12] M.D. White , Police oﬃcer body-worn cameras: assessing the evidence, Wash- ington, DC: Oﬃce of Community Oriented Policing Services, 2014 .

[13] A. Ortis , G.M. Farinella , V. D’Amico , L. Addesso , G. Torrisi , S. Battiato , Recfusion:

automatic video curation driven by visual content popularity, ACM Multimedia, 2015 .

[14] A . Fathi , A . Farhadi , J.M. Rehg , Understanding egocentric activities, in: IEEE In- ternational Conference on Computer Vision, 2011, pp. 407–414 .

[15] H. Pirsiavash , D. Ramanan , Detecting activities of daily living in ﬁrst-person camera views, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2847–2854 .

[16] D. Ravi , C. Wong , B. Lo , G.-Z. Yang , Deep learning for human activity recognition: a resource eﬃcient implementation on low-power devices, in: Wearable and Implantable Body Sensor Networks (BSN), 2016 IEEE 13th International Conference on, IEEE, 2016, pp. 71–76 .

[17] M.S. Ryoo , L. Matthies , First-person activity recognition: what are they doing to me? in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2730–2737 .

[18] Y. Poleg , C. Arora , S. Peleg , Temporal segmentation of egocentric videos, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2014 . [19] A. Furnari , G.M. Farinella , S. Battiato , Temporal segmentation of egocentric

videos to highlight personal locations of interest, in: International Workshop on Egocentric Perception, Interacion, and Computing - European Conference on Computer Vision Workshop, 2016 .

[20] Z. Lu , K. Grauman , Story-driven summarization for egocentric video, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2013 .

[21] B. Soran , A. Farhadi , L. Shapiro , Generating notiﬁcations for missing actions:

don’t forget to turn the lights off!, in: International Conference on Computer Vision, 2015 .

[22] A. Krizhevsky , I. Sutskever , G.E. Hinton ,Imagenet classiﬁcation with deep convolutional neural networks, in: Advances in Naural Information Processing Sys- tems, 2012, pp. 1097–1105 .

[23] L. Deng , A tutorial survey of architectures, algorithms, and applications for deep learning, APSIPA Trans. Signal Inf. Process. 3 (2014) e2 .

[24] P. Agrawal , R. Girshick , J. Malik , Analyzing the performance of multilayer neural networks for object recognition, in: European Conference on Computer Vi- sion, 2014, pp. 329–344 .

[25] S. Bell , P. Upchurch , N. Snavely , K. Bala , Material recognition in the wild with the materials in context database, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2015 .

[26] J. Hosang , M. Omran , R. Benenson , B. Schiele , Taking a deeper look at pedes- trians, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2015 .

[27] J. Long , E. Shelhamer , T. Darrell , Fully convolutional networks for semantic segmentation, in: The IEEE Conference on Computer Vision and Pattern Recogni- tion, 2015 .

[28] J. Yosinski , J. Clune , Y. Bengio , H. Lipson , How transferable are features in deep neural networks? in: Advances in Neural Information Processing Systems, 2014, pp. 3320–3328 .

[29] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv: 1409.1556 2014.

[30] C.D. Manning , P. Raghavan , H. Schütze , et al. , Introduction to information retrieval, Cambridge University Press Cambridge, 2008 .

[31] G.M. Farinella , S. Battiato , Scene classiﬁcation in compressed and constrained domain, IET Comput. Vision 5 (5) (2011) 320–334 .

[32] G.M. Farinella , D. Ravì, V. Tomaselli , M. Guarnera , S. Battiato , Representing scenes for real-time context classiﬁcation on mobile devices, Pattern Recognit.

48 (4) (2015) 1086–1100 .

[33] B. Zhou , A. Lapedriza , J. Xiao , A. Torralba , A. Oliva , Learning deep features for scene recognition using places database, in: Advances in Neural Information Processing Systems, 2014, pp. 4 87–4 95 .