• Non ci sono risultati.

Organizing egocentric videos of daily living activities Pattern Recognition

N/A
N/A
Protected

Academic year: 2021

Condividi "Organizing egocentric videos of daily living activities Pattern Recognition"

Copied!
12
0
0

Testo completo

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Organizing egocentric videos of daily living activities

Alessandro Ortis

a

, Giovanni M. Farinella

a,

, Valeria D’Amico

b

, Luca Addesso

b

, Giovanni Torrisi

b

, Sebastiano Battiato

a

a Image Processing Laboratory, Dipartimento di Matematica e Informatica, Universitã degli studi di Catania, Viale A. Doria 6, Catania - 95125 Italy

b TIM - Telecom, JOL WAVE, Viale A. Doria 6, Catania - 95125 Italy

a rt i c l e i n f o

Article history:

Received 4 October 2016 Revised 3 May 2017 Accepted 7 July 2017 Available online 14 July 2017 Keywords:

First person vision Video summarization Video indexing

a b s t r a c t

Egocentricvideosarebecomingpopularsincethepossibilitytoobservethesceneflowfromtheuser’s pointofview(FirstPersonVision).Amongthedifferentapplicationsofegocentricvisionisthedailyliving monitoringofauserwearingthecamera.Weproposeasystemabletoautomaticallyorganizeegocentric videosacquiredbytheuseroverdifferentdays.Throughanunsupervisedtemporalsegmentation,each egocentricvideoisdividedinchaptersbyconsideringthevisualcontent.Theobtainedvideosegments relatedtothedifferentdaysarehenceconnectedaccordingtothescenecontextinwhichtheuseracts.

Experimentsonachallenging egocentricvideo datasetdemonstrate theeffectiveness ofthe proposed approachthatoutperformswithagoodmarginthestateoftheartinaccuracyandcomputationaltime.

© 2017ElsevierLtd.Allrightsreserved.

1. Introductionandmotivations

Inthe last yearstherehasbeen arapidemerging ofwearable devices,includingbodysensors,smartclothingandwearablecam- eras.Thesetechnologiescanhaveasignificantimpactonourlives iftheacquireddataareconsidered toassisttheusersintasksre- lated to themonitoring of the quality oflife [1–4]. Inparticular, egocentriccamerasenabledthedesignanddevelopmentofuseful systems that can be organizedintothree generalcategories with respecttotheassistivetasks:

(i) UserAware-systems abletounderstand whattheuserisdo- ingandwhat/howheinteractwith,byrecognizingactionsand behavioursfromfirstpersonperspective.

(ii) (ii) Environment/Objects Aware - systems able to understand what objects surround the user, where they are withrespect theuser’sperspectiveandwhattheenvironmentlookslike.

(iii) TargetAware-systemsabletounderstandwhatothersaredo- ing, andhow they interactwith the userthat is wearingthe device.

Theegocentricmonitoringofaperson’sdailyactivitiescanhelp tostimulatethememoryofusersthatsufferfrommemorydisor- ders [5].Several works on recognition andindexing of daily liv-

Corresponding author.

E-mail addresses: ortis@dmi.unict.it (A. Ortis), gfarinella@dmi.unict.it (G.M. Farinella), valeria1.damico@telecomitalia.it (V. D’Amico), luca.addesso@telecomitalia.it (L. Addesso), giovanni.torrisi@telecomitalia.it (G.

Torrisi), battiato@dmi.unict.it (S. Battiato).

ing activities of patients with dementia have been recently pro- posed [6–9]. The exploitation of aids for people with memory problems is proved to be one of the most effective ways to aid rehabilitation[10].Furthermore,therecordingandorganizationof dailyhabits performed by a patient can help a doctor to have a betteropinionwithrespecttothespecificpatient’sbehaviourand hence his health needs. To this aim, a set of egocentric videos recorded among different days with a camera hold by a patient canbeanalysedbyexpertstomonitortheuser’sdailylivingactiv- itiesforassistivepurposes.Theliverecordingforlifeloggingappli- cationsposes challengesonhow toperformautomatic indexand summarizationofbigpersonalmultimediadata[11].

Beside assistive technologies, the segmentation and semantic organization of egocentric videos is useful in many application scenarios where wearable cameras have recently become pop- ular, including lifelogging [11], law enforcement [12] and social cameras [13]. Other applications of egocentric video analysis are related to action and activity recognition from egocentric videos[14–16],andrecognitionofinteraction-levelactivities from videosinfirst-person view [17] (e.g.,recognition ofhuman-robot interactions).Foralltheseapplications, thesegmentation ofdaily egocentric videos into meaningful chapters and the semantic organization of such video segments related to different days is animportantfirststepwhich addsstructure toegocentricvideos, allowing tools for the indexing, browsing and summarization of egocentricvideosets.

Inthelast years,severalpapershaveaddresseddifferentprob- lems related to vision tasks from first person perspective. The work in [18] proposes a temporal segmentation method of ego- http://dx.doi.org/10.1016/j.patcog.2017.07.010

0031-3203/© 2017 Elsevier Ltd. All rights reserved.

(2)

centricvideoswithrespectto12 differentactivities organizedhi- erarchicallyupon cuesbased on wearer’smotion (e.g.,static, sit- ting,standing, walking,etc.). Abenchmark studyconsidering dif- ferent wearable devices for context recognition with a rejection mechanismispresented.Thesystemdiscussedin[19]aimsatseg- mentingunstructured egocentricvideosto highlightthepresence ofgivenpersonal contexts ofinterest. In[4]the authorsperform a benchmark on the main representations and wearable devices usedforthetaskofcontext recognitionfromegocentricvideos. A methodthat takes a long input video andreturns a set ofvideo subshotsdepictingthe essentialmoments isdetailedin [20].The methodproposed in[21]learnsthesequencesofactions involved inasetofhabitualdailyactivitiestopredict thenext actionsand generatenotificationsiftherearemissingactionsinthesequence.

Theframework presentedin[13](RECfusion), isabletoautomat- icallyprocess multiplevideo flows fromdifferent mobile devices tounderstand themostpopular scenes fora group ofend-users.

Theoutput is a video which representsthe most popularscenes organizedovertime.

Inthispaper,webuildontheRECfusionmethod[13]improving itforthecontextofdailylivingmonitoringfromegocentricvideos.

InRECfusionmultiplevideosareanalysedbyusingtwoalgorithms:

theformerisusedtosegmentthedifferentscenes overtime(in- traflowanalysis).Thelatterisemployedtoperformthegroupingof thevideosrelatedtotheinvolveddevicesovertime.Asreportedin theexperimentalresultsof[13],theintraflowanalysisofRECfusion sufferswhenappliedtoegocentricvideosbecausetheyarehighly unstableduetotheuser’smovements.

The framework proposed in this paper allows to have better performancesforegocentricvideosorganization,onbothsegmen- tation accuracy and computational costs. The proposed method takesasetofegocentricvideosregardingthedailylivingofauser amongdifferentdays,andperformsanunsupervisedsegmentation ofthem. The obtained video segments among different days are then organizedby contents. The video segments of the different dayssharingthe same contentsare then visualized by exploiting aninteractiveweb-baseduserinterface.Inourframeworkwe use auniquerepresentationfortheframeswhichisbasedonCNNfea- tures[22](forbothintraflow andbetweenflowsanalysis)instead ofthetwo differentrepresentations basedon SIFTandcolorhis- togramsasproposedinRECfusion.Experimentsshowthatthepro- posedframeworkoutperformsRECfusionfordailylivingegocentric videoorganization.Moreover,theapproachobtainsbetteraccuracy thanRECfusion forthepopularityestimationtaskforwhichREC- fusionhasbeendesigned.

Therestofthepaperisorganizedasfollows.Section2presents thedesignedframework.Section3describestheconsideredwear- abledataset.Thediscussionontheobtainedresultsisreportedin Section4.InSection5thedevelopedsystemiscomparedwithre- spect to RECfusion [13] on mobile videos. FinallySection 6 con- cludesthepaperandgivesinsightsforfurtherworks.

2. Proposedframework

The proposed framework performs two main steps on the videos acquired by a wearable camera: temporal segmentation andsegmentorganization. Fig.1showstheschemeoftheoverall pipelinerelatedtotheoursystem.

Starting froma set ofegocentric videosrecordedamong mul- tipledays(Fig.1(a)),the firststep performsan intraflow analysis ofeach video to segment it with respect to the differentscenes observedbytheuser(Fig.1(b)).Eachvideo isthensegmentedby employingtemporalandvisualcorrelationsbetweenframesofthe samevideo (Fig. 1(b):the colourofeach block identifies a scene observed within the same video). Then, the segments obtained overdifferentdaysreferred to thesame contentsare groupedby

meansofabetweenflowsanalysiswhichimplementsanunsuper- vised clustering procedure aimed to semantically associate video segments obtained fromdifferent videos (Fig. 1(c):the colour of each block nowidentifies video segments associatedto thesame visualcontent,observedindifferentdays).Thesystemhencepro- ducessets of videoclips related toeach location where theuser performsdailyactivities(e.g.,thesetoftheclips overdayswhere the userwashes dishes inthe kitchen, the setrelated tothe ac- tivityoftheuserofplayingpiano,andsoon).Theclips areorga- nized taking into account both, visual andtemporal correlations.

Finally,theframeworkprovidesawebbasedinterfacetoenablea browsingoftheorganizedvideos(Fig.1(d)).Inthefollowingsub- sectionsthedetailsonthedifferentstepsinvolvedintothepipeline aregiven.

2.1. Intraflowanalysis

Theintraflowanalysisperformstheunsupervisedtemporalseg- mentation of each input video, as well as associates a scene ID to each video segment. Segments with the same content have to share the same scene ID. This problem has been addressed in[13]forvideosacquiredwithmobile devices.Tobetter explain theproblem,tnthefollowingwefocusouranalysisontheissues relatedtotheintraflowalgorithmdetailedin[13]whenappliedon first-personvideos.Thisisusefultointroduce themainproblems ofaclassicfeaturebasedmatchingapproachfortemporalsegmen- tationin wearabledomain. Then we presentoursolution forthe intraflowanalysis.Furthermore,inAppendixAthepseudocodede- scribingtheproposedintraflowanalysisapproachisreported.

2.1.1. IssuesrelatedtoSIFTbasedtemplates

Theintraflowanalysisusedin[13]comparestwoscenesconsid- eringthenumberofmatchings betweenareferencetemplateand theframeunderconsideration(currentframe).Thescenetemplate is aset ofSIFT descriptorsthat must accomplish specific proper- tiesof“reliability”.Whenthealgorithmdetectsasuddendecrease in the number of matchings, it refreshes the reference template extractinganewsetofSIFTsandsplitsthevideo. Inordertode- tectsuch changes,thesystemcomputesthevalue oftheslopein thematchingfunction(i.e.,thevariationofthenumberofmatch- ings in a range interval). When the slope is positive and over a threshold(whichcorrespondto asuddendecreaseofthenumber ofmatchings betweenthe SIFTdescriptors)thealgorithm finds a newtemplate(i.e.,anewblock).Whenanewtemplateisdefined, it iscompared withthe pasttemplates in orderto understandif itregardsanewsceneoritisrelatedtoa knownone(backward searchphase). Althoughthismethodworksverywell withvideos acquiredwithmobile cameras, it presentssome issues when ap- pliedonvideosacquiredwithwearabledevices.Insuchegocentric videos,thecameraisconstantlymovingduetotheshakeinduced bythenaturalheadmotionofthewearer.Thiscausesacontinuous refreshofthereferencetemplatethatisnotalwaysmatchedwith similarscenesduringthebackwardsearch.Hence,themethodpro- ducesanoversegmentation ofthe egocentricvideos.Furthermore, itrequirestoperformseveralSIFTdescriptorextractionandmatch- ing operations(including geometricverifications) to exclude false positive matchings. This have a negative impact on the realtime performances.ThefirstrowofFig.2showstheGroundTruthseg- mentation of the video acquired with a wearable camera in an home environment.1 Anexample ofatemporal segmentationob- tainedwiththeSIFTbasedinterflowanalysisproposedin[13]on a egocentric video is reported in the second row of Fig. 2. The

1 The video related to the example in Fig. 2 is available for visual inspection at the URL http://iplab.dmi.unict.it/dailylivingactivities/homeday.html .

(3)

Fig. 1. Overall scheme of the proposed framework.

Fig. 2. Output of the intraflow analysis using SIFT and CNN features when applied to the video Home Day 1 . The first row is the Ground Truth of the video. The second row shows the result of the intraflow analysis with the method discussed in [13] . The third row shows the result of the intraflow analysis of our method, whereas the last two rows show the results of our method with the application of the proposed scene coherence and duration filtering criteria.

algorithm works well when the scene is quite stable (e.g.,when theuseriswatchingaTVprogram),butitperformsseveralerrors whenthesceneishighlyunstabledueheadmovements.Infact,in themiddleofthevideorelatedtoFig.2,theuseriswashingdishes atthesink,andheiscontinuouslymovinghishead.Inthisexam- ple the intraflow approachbased on SIFTfeatures detects a total of 8different scenes instead of3. The algorithmcannot findthe matchings betweenthecurrentframeandthereferencetemplate duetotwomainreasons:

1. when the video is unstable, even though the scene content doesn’t change, the matchings betweenlocal features are not reliableandstablealongtime;

2. Inaclosedspacesuch asanindoor environment,thedifferent objects ofthescene canbe very closeto theviewer.Hence a smallmovementoftheuser’sheadisenoughtocauseanhigh numberofmismatchesbetweenlocalfeatures.

2.1.2. CNNbasedimagerepresentation

To deal withthe issues describedin the previous section, we exploitanholisticfeaturetodescribethewholeimageratherthan anapproachbasedonlocalfeatures.Inparticular,intheintraflow analysis we representframes by using features extracted with a ConvolutionalNeuralNetwork(CNN)[23].Specifically,weconsider the CNN proposed in [22] (AlexNet). In our experiments, we ex- ploittherepresentationobtainedconsideringtheoutputofthelast hiddenlayerofAlexNet,whichconsistsofa4096dimensionalfea- ture vector (fc7 features).We decided to useAlexNet representa- tion since it hasbeen successfully used as a general image rep- resentationforclassificationpurposeinthelastfew years[24,25]. Moreover,thefeaturesextractedbyAlexNethavebeensuccessfully usedfortransferlearning[26–28].Finally,AlexNetarchitectureisa smallnetworkcomparedtoothers(e.g.,VGG[29]).Thus,itallows toperformthefeatureextractionveryquickly.Theproposedsolu- tioncomputesthesimilaritybetweenscenesby comparingapair offc7features withthecosinesimilaritymeasure.Thecosinesim- ilarity of two vectors measures the cosine of the angle between them. Thismeasure isindependent ofthe magnitudeofthe vec- tors, andis well suitedto comparehigh dimensionalsparse vec- tors,such asthefc7features. Thecosinesimilarityofthetwo fc7

featurevectorsv1andv2 iscomputedasfollowing:

CosSimilarity(

v

1,

v

2)=

v

1·

v

2

 v

1

  v

2



(1)

2.1.3. Proposedintraflowanalysis

Duringtheintraflowanalysis,theproposedalgorithmcomputes thecosine similaritybetweenthecurrentreferencetemplate and thefc7featuresextractedfromtheframesfollowingthereference.

Whenthealgorithmdetectsasuddendecreaseinthecosinesim- ilarity,itrefreshesthereferencetemplateselectinganewfc7fea- ture corresponding to a stable frame. As in [13], to detect such changesthesystemcomputesthevalueoftheslope(i.e.,thevari- ationofthecosinesimilarityinarangeinterval).When theslope hasapositivepeak(whichcorrespondtoasuddendecreaseofthe cosinesimilarity)thealgorithmfindsanewtemplateandproduces anewvideosegment.Therearetwo casesinwhichtheintraflow analysiscomparestwotemplates:

1.a template is compared with the features extracted from the forwardframes,whenthealgorithmhavetocheckitseligibility tobeareferencetemplatefortheinvolvedscene;

2.A template is compared with the past templates during the backwardcheckingphasetoestablishthesceneID.

Inthefirst case, theelapsed timebetweenthe two compared framesdependsonthesamplingrateoftheframes(inourexper- iments,wesampledatevery20framesforvideosrecordedat30 fps).Differently,theframescomparedduringthebackwardcheck- ing could be rather different due to the elapsed time between them.Forthis reason, whenwe compare anew templatewith a past template,we assign the templates to the same sceneID by usinga weakerconditionwithrespecttotheoneusedinthefor- wardverification.Intheforwardprocess,thealgorithmassignsthe samesceneIDtothecorrespondingframesifthecosinesimilarity betweentheir descriptors is higher thana threshold Tf (equal to 0.60inourexperiments).Whenthealgorithmcomparestwotem- plates in the backward process, it assign the same scene to the corresponding framesif the similarityis higher than a threshold Tb(equalto0.53inourexperiments).

Besidestheimagerepresentation,ourintraflowalgorithmintro- ducestwoadditionalrules:

(4)

1.In[13]eachvideoframeisassignedtoasceneIDoritisclassi- fiedasNoiseandhencerejected.Ourapproachisabletodistin- guishtherejectedframeamongtheonescausedby themove- mentoftheheadoftheuser(HeadNoise)andtheframesre- latedtothetransitionbetweentwoscenes inthevideo (Tran- sition). Whena newtemplate isdefinedafteragroup ofcon- secutive rejectedframes, theframesbelonging tothe rejected groupareconsideredas“Transition” ifthedurationoftheblock is longer than 3 s (i.e.,head movements are fasterthan 3 s).

Otherwisetheyareclassifiedas“HeadNoise”.Incaseofnoise, thealgorithmdoesn’tassign anewsceneID totheframethat follows the “HeadNoise” video segment because the noise is relatedtoheadmovements,buttheuserisinthesamecontext.

When theuserchangesits locationhechangeshis positionin theenvironment.Thus,thetransitionbetweendifferentscenes involvesalongertimeinterval.

2. The second rule is related to the backward verification.

In [13] it is performedstarting from the last found template andproceeding backward.The search stops whenthe process finds the firstpast template that havea good matching score withthenewtemplate.Suchapproachisquiteappropriate for the method in[13] becauseit compares sets of localfeatures andreliesonthenumberofachievedmatchings.Theapproach proposed in this paper compares pairs of feature vectors in- steadofsetsoflocaldescriptors,andselectsthepasttemplate thatyieldsthebestsimilaritytothenewone.Inparticular,the methodcomparesthenewtemplatewithallthepreviousones, consideringallthepasttemplatesthatyieldsacosinesimilarity greater than Tb.From thisset of positivecases, thealgorithm selectstheonethatachievesthemaximumsimilarity,evenifit isnotthemostrecentinthetime.

Considering the example in Fig. 2, the segmentation results achieved with the proposed intraflow approach (third row) are muchbetterthantheonesobtainedusingSIFTfeatures[13](sec- ondrow).

After the discussed intraflow analysis a segmentation refine- mentisperformedasdetailedinthefollowingsubsection.

2.1.4. Intraflowsegmentationrefinement

Starting fromtheresultoftheproposedintraflow analysis(see thethirdrowofFig.2), wecan easilydistinguishbetween“Tran- sition” and“Noise” blocksamongall therejectedframes. Ablock ofrejectedframesisa“Noise” block ifboththeprevious andthe nextdetectedscenesarerelatedtothesamevisualcontent,other- wiseitismarkedasa“Transitionblock”. Werefertothiscriteria asSceneCoherence.Theresultofthisstepontheexampleconsid- eredin Fig. 2 is shown in the fourthrow. When comparing the segmentationwiththeGroundTruth(firstrow),theimprovement withrespectto [13](secondrow) isevident. Moreover, manyer- rorsof the proposed intraflow analysis (third row) are removed.

Thesecondstepofthesegmentationrefinementconsistsinconsid- eringtheblocksrelatedtouseractivityinalocationwithadura- tionlongerthan30s(DurationFiltering),unlesstheyarerelatedto thesamesceneoftheprevious block(i.e.,afteraperiodofnoise, thesceneisthesameasbeforebutithasalimitedduration).We appliedthis criteriabecause, inthe context of Activities ofDaily Living(ADL),we areinterestedtodetecttheactivitiesofaperson inalocationthathaveasignificantdurationinordertobeableto observethebehavior.ThisrefinementstepfollowstheSceneCoher- enceone.The finalresultonthe consideredexample isshownin thelastrowofFig.2.Despitesomeframesareincorrectlyrejected (duringscenechanges)theproposedpipelineismuchmorerobust than[13](comparefirst,secondandfifthrowsofFig.2).Thisout- comeisquantitativelydemonstratedintheexperimentalsectionof thispaperonadatasetcomposedby13egocentricvideos.

2.2. Betweenvideoflowsanalysis

Wheneachegocentricvideoissegmented,thesuccessivechal- lengeis todetermine whichvideo segments among thedifferent daysare related tothe samecontent. Givena segment block bvA extracted from the egocentric video vA, we compare it with re- specttoallthesegmentblocksextractedfromtheotheregocentric videosvBj.Torepresenteachsegment,weconsideragaintheCNN fc7featuresextractedfromoneoftheframesofthevideosegment.

Thisframeisselectedconsideringthetemplatewhichachievedthe longerstabilitytime duringtheintraflow analysis.Foreachblock bvA,ourapproachassignstheIDofbvA (obtainedduringintraflow analysis)toalltheblocksbvB

j extractedfromtheother egocentric videosvBj suchthat

bvBj =argmax

bvB jvBj

{

CosSimilarity(bvA,bvBj)

| σ

(bvA,vBj)≥ Tσ

} ∀ v

Bj (2)

where σ(bvA,vBj) is the standard deviation of the cosine similar- ityvaluesobtained by consideringthe segment block bvA andall the blocks of vBj, and Tσ is the decision threshold. This proce- dure isperformedforallthe segmentblocksbvA ofthevideo vA. When all the blocks of vA have been considered, the algorithm takesintoaccount a videoofanother dayandthematchingpro- cessbetweenvideosegmentsofthedifferentdaysisrepeatedun- tilall thevideosegmentsinthepool areprocessed.Inthiswaya sceneIDisassignedtoall theblocksofalltheconsidered videos.

The pairs ofblockswith thesame ID areassociated to thesame scene,eveniftheybelongtodifferentvideos,andallthesegments areconnectedinagraphwithmultipleconnectedcomponents(as in Fig. 1(c)). When there is a high variabilityin the cosine sim- ilarity values (i.e.,the value of σ(bvA,vB

j) is high), the system as- signs the sceneID tothe segment block that achievedthe max- imum similarity. When a block matches withtwo blocks related totwodifferentsceneIDs,thesystemassignsthesceneIDrelated tothe blockwhich achievedthehighestsimilarityvalue.When a block isn’t matched,itmeans thatall the similarityvaluesofthe miss-matchedblocks aresimilar.Thiscauseslow valuesofσ and

helpsthesystemtounderstandthat thesearchedsceneID isnot present(i.e.,scenes withonlyone instanceamongalltheconsid- eredvideos). Tobetter explain the proposed approach,the pseu- docoderelatedtotheabovedescribedbetweenvideoflowsanaly- sisisreportedinAppendixA.

3. Dataset

Todemonstratetheeffectivenessoftheproposedapproach,we have considered a set ofrepresentative egocentric videosto per- formtheexperiments.Theegocentric videosare relatedtodiffer- ent days and have been acquired using a Looxcie LX2 wearable camerawitharesolutionof640× 480pixels.Thedurationofeach videoisabout10min.Thevideosarerelatedtothefollowingsce- narios:

Home Day: a set of four egocentric videos taken in a home environment.Inthisscenario, theuserperforms typicalhome activities such as cooking, washing dishes, watching TV, and playing piano. This set of videos has been borrowed from the 10contexts dataset proposed in [19] which is available at the following URL: http://iplab.dmi.unict.it/PersonalLocations/

segmentation/.

OfficeDay:asetofsixegocentricvideostakeninahomeand in different office environments. This set of videos concerns severalactivities performedduringa typicalday. Alsothisset ofvideosbelongstothe10contextsdataset[19].

(5)

Table 1

Intraflow performances on the considered egocentric video dataset obtained using [13] and the proposed approach. Each test is evaluated considering the accuracy of the temporal segmentation (Q), the computation time (Time) and the number of the scenes detected by the algorithm (Scenes). The accuracy is measured as the percentage of correctly classified frames with respect to the Ground Truth. The measured time includes the feature extraction process.

Intraflow proposed in [13] Proposed Intraflow Proposed Interflow Approach Approach with Segmentation Refinement

Video Scenario Scenes Q Scenes Time Q Scenes Q Scenes Time

1 HomeDay1 3 62.5% 8 20’45” 77.5% 4 92.5% 3 1’23”

2 HomeDay2 3 71.6% 3 20’18” 80.3% 4 94.5% 3 1’46”

3 HomeDay3 3 64.3% 5 19’03” 79.7% 5 94.3% 3 1’21”

4 HomeDay4 3 84.4% 3 8’36” 91.8% 3 85.4% 2 36”

5 WorkingDay1 4 95.7% 5 16’16” 98.4% 5 99.5% 4 1’22”

6 WorkingDay2 4 82.5% 5 15’15” 98.9% 5 100% 4 1’08”

7 WorkingDay3 5 98.7% 6 19’02” 99.2% 6 99.4% 5 1’29”

8 OfficeDay1 3 23.0% 5 24’8” 55.3% 19 66.9% 2 2’39”

9 OfficeDay2 2 59.7% 2 10’25” 90.0% 3 98.7% 2 1’26”

10 OfficeDay3 3 57.2% 4 13’28” 83.6% 10 96.3% 3 1’49”

11 OfficeDay4 3 52.0% 4 11’37’ 79.5% 5 84.1% 4 1’41”

12 OfficeDay5 3 70.7% 3 8’35” 86.7% 4 95.9% 3 1’21”

13 OfficeDay6 3 78.8% 3 9’33” 61.5% 5 94.5% 4 1’34”

Average 69.4% 15’9” 83.3% 91.9% 1’31”

WorkingDay:asetofthreevideostakeninalaboratoryenvi- ronment.Theactivities performedbytheuserinthisscenario regardsreadingabook,workinginalaboratory,sittinginfront ofacomputer,etc.

Eachvideo hasbeenmanuallysegmentedtodefine theblocks of framesto be detected in the intraflow analysis.Moreover, the segmentshavebeenlabeledwiththesceneIDtobuildtheGround Truthforthebetweenvideoanalysis.TheGroundTruthisusedto evaluate the performances of theproposed framework. The used egocentricvideos,aswellastheGroundTruth,areavailableatthe followingURL:http://iplab.dmi.unict.it/dailylivingactivities/. 4. Experimentalresults

In this section we report the temporal segmentation and the between flows video analysis results obtained on the considered dataset.

4.1. Temporalsegmentationresults

Table 1 shows the performances of the proposed temporal segmentation method (see Section 2.1). We compared our solu- tion withrespecttothe oneadopted byRECfusion [13].Foreach methodwecomputedthequalityofthesegmentationastheper- centage of the correctlyclassified frames(Q), the numberof de- tectedscenesandthecomputationaltime.2Theproposedapproach obtains strong improvements (up to over 30%) in segmentation quality withrespect to RECfusion (e.g., resultsat eight and nine rows in Table 1). Furthermore, the application of the segmenta- tion refinements provides improvements up to43% in segmenta- tion quality (results at row eight in Table 1). In the fourth row of Table 1 (related to the analysis of the video Home Day 4), we can observe that the application of the segmentation refine- mentscausesadecreaseinperformances.Thisvideoisveryshort compared to the other videos of the dataset. It has a duration of just4’28” and consists of a sequence of three different scenes (piano, sink and TV). The sceneblocks are correctlydetected by the proposed intraflow approach, which finds exactly 3 different scenesandachievesa91,8%ofaccuracywithoutrefinement.How- ever, themiddle scene(sink) hasa duration ofjust 22s accord- ingtotheGroundTruthusedin[19],thustherefinementprocess

2 The experiments have been performed with Matlab 2015a, running on a 64-bit Windows 10 OS, on a machine equipped with an Intel i7-3537U CPU and 8GB RAM.

rejects this block due to the applicationof the Duration Filtering criteria.Consideringthe meanperformances(lastrowinTable1) oursystemachievesanimprovementofover14%withoutsegmen- tationrefinements, withover 22% of margin after thesegmenta- tion refinement. The proposed method also reduces the compu- tational time ofmore than 21 minin some cases(eighth rowin Table 1). It has an average computational time saving of about 13 min withrespect to the compared approach [13]. The results ofTable 1 show that the application ofthe Scene Coherence and theDurationFilteringcriteriausedinthesegmentation refinement step(Section2.1.4)allowstodetectthecorrectnumberofscenes.

Insum,considering thequalitativeandquantitative resultsre- portedrespectivelyin Fig.2 andTable 1,the proposed systemis demonstratedtoberobust forthetemporalsegmentationofego- centric videos, and it provides high performances with a lower computationaltimewithrespectto[13].

4.2.Betweenflowsvideoanalysisresults

Inourexperiments,all the segmentsrelatedto theHome Day andWorkingDayscenarios havebeen correctlyconnectedamong thedifferent days withouterrors. Fig. 3shows two timelines re- lated to the videos of the scenario Working Day, whereas Fig. 4 showsthe timelines related to the scenario Home Day. The first timelineinFigs.3and4showstheGroundTruth labeling.Inthe timeline, the black blocksindicate the transitionintervals (to be rejected). The second timeline shows the result obtainedby our framework. Inthiscase, the blackblocks indicate theframes au- tomaticallyrejectedby thealgorithm.Inordertobetterasses the resultsobtainedby the proposedsystem, the readercan perform a visual inspection of all the results produced by our approach atthefollowing URL: http://iplab.dmi.unict.it/dailylivingactivities/. Throughthewebinterfacethedifferentsegmentscanbeexplored.

DifferentlythantheHomeDayandWorkingDayscenarios,some matching erroroccurred inthe between flow analysisof the Of- ficeDayscenario (seeFig. 5).Since we aregroupingvideo blocks by contents, to better evaluate the performance of the proposed betweenflowanalysis weconsidered three qualityscores usually usedinclusteringtheory.The simplestclusteringevaluationmea- sureisthePurityoftheclustering:eachclusterofvideosegments obtainedafterthe betweenflow analysisis assignedto themost frequentclassinthecluster.ThePuritymeasureishencethemean ofthenumberofcorrectlyassignedvideoblockswithin theclus-

(6)

Fig. 3. The two timelines show the Ground Truth segmentation of egocentric videos related to the Working Day scenario and their organization (top). The inferred segmen- tation and organization obtained by the proposed method is reported in the bottom.

Fig. 4. The two timelines show the Ground Truth segmentation of egocentric videos related to the Home Day scenario and their organization (top). The inferred segmentation and organization obtained by the proposed method is reported in the bottom.

Fig. 5. The two timelines show the Ground Truth segmentation of egocentric videos related to the Office Day scenario and their organization (top). The inferred segmentation and organization obtained by the proposed method is reported in the bottom. In this case, the blocks related to the scenes ‘office’, ‘studio’ and ‘laboffice’ are clustered in the same cluster (identified by the color red), with the exception of only one ‘laboffice’ block.

(7)

Table 2

Between video clustering results.

Dataset # of Videos Original GT New GT

Purity RI F 1 F 2 Purity RI F 1 F 2

WorkingDay 3 1 1 1 1

HomeDay 4 1 1 1 1

OfficeDay 6 0.68 0.81 0.49 0.57 0.95 0.89 0.80 0.75

Office + Home 10 0.58 0.83 0.41 0.54 0.74 0.87 0.61 0.67

ters.

Purity= 1 N

k

i=1

maxj

|

CiTj

|

(3)

whereNisthenumberofvideoblocksinthecluster,kisthenum- berofclusters,CiistheithclusterandTjisthesetofelementsof the class j that are present in thecluster Ci. The clustering task (i.e.,thegroupingperformedbythebetweenflowanalysisinour case) can beviewed asa seriesofdecisions, one foreach ofthe N(N− 1)/2pairsoftheelementstobeclustered[30].

Thealgorithmobtains atruepositive decision(TP)ifitassigns two videosofthesameclass tothesame cluster,whereas atrue negativedecision(TN)ifitassignstwovideosofdifferentclassto different clusters. Similarly, a false positive decision (FP) assigns two differentvideosto thesame clusteranda false negativede- cision (FN)assigns two similar videosto different clusters. With theaboveformalizationwecancompute theconfusionmatrixas- sociatedtothepairingtask.

From the confusion matrix we compute the Rand Index (RI), which measures the percentage ofthe correct decisions (i.e.,the pairingaccuracy)ofthebetweenflowanalysis:

RI= TP+TN

TP+FP+FN+TN (4)

In order to take into account both precision and recall, we also consideredtheFβ measuredefinedasfollowing:

Fβ=(1+

β

2)· Precision· Recall

β

2Precision+Recall (5)

Specifically, we considered the F1 measure that weights equally precision and recall, and the F2 measure, which weights recall higherthanprecisionand,therefore,penalizesfalsenegativesmore than false positives. Table 2 shows the results of the proposed between flow analysis approach considering the aforementioned measures. For the Home Day andWorking Day scenarios our ap- proachachievesthemaximumscoresforalltheconsideredevalu- ation measures (column “Original GT” of Table 2). Regarding the Office Day scenario, we obtain lower scores of Purity and Rand Index. The obtained values of F1 and F2 measures indicate that theproposedbetweenflowapproachachieveshigherrecallvalues thanprecision.ThiscanbefurtherverifiedbyobservingtheFig.6. This figure shows theco-occurrence matrix obtained considering the OfficeDay scenario.The columns ofthe matrixrepresentthe computedgraphcomponents(clusters)obtainedwiththebetween flowanalysis,whereaseachrowrepresentsasceneclassaccording to the Ground Truth. The values of this matrix express the per- centage of video blocks belongingto a specific class that are as- signed to each graphcomponent (accordingto the Ground Truth used in [19]). This figure showsthat even ifdifferent blocks are included in the samegraph component (FP), the majority ofthe blocksbelongingtothesameclassareassignedtothesamegraph component(TP).Thesecondcolumnoftheco-occurrencematrixin Fig.6showsthatthe“laboffice”,“office” and“studio” blockshave been assigned to the graph component C2. This error is due to strongambiguityinthevisualcontentofthesethreeclasses.Fig.8 showssomeexampleoftheframesbelongingtotheseclasses.We

Fig. 6. Co-occurrence matrix obtained considering the Office Day scenario.

Fig. 7. Co-occurrence matrix obtained considering the Office Day scenario with the

“New GT”.

can observe that these scenes are all related to the same activ- ity(i.e., working ata PC) and the visual content is very similar.

WerepeatedthetestsconsideringanalternativeGroundTruththat considersthethreeabovementionedsceneclassestobeaunique class(Laboffice).Theresultsofthistestarereportedinthecolumn labeled as “New GT” of Table 2. In this case the OfficeDay sce- narioresultachieves significant improvementsforall the consid- eredevaluationmeasures. Fig. 7 showsthe co-occurrence matrix obtainedconsideredthenewGroundTruth.

ThelastrowofTable2reportstheresultsobtainedbyapplying theproposed approachon the setof 10videos obtainedby con- sideringonlytheHomeDayandtheOfficeDaysetsofvideos.This testhavebeendonetoanalysetherobustnessoftheproposedap- proachconsideringmorecontextsandalsoscenesthatappearonly insome videos.Furthermore,to betterassessthe effectivenessof the proposed approach, we performed a number of incremental testsby varyingthe numberof inputvideosfrom 2to 10videos of the considered set. This correspond to consider from 2 to 10 daysofmonitoring.The obtainedscores reportedinFig. 9shows that the evaluationmeasures are quite stable. We alsoevaluated theperformanceoftheproposedapproachbyvarying thethresh- oldTσ value(seeFig.10).Alsointhiscasetheresultsdemonstrate thatthemethodisrobust.Wealsotestedthebetweenanalysisap-

(8)

Fig. 8. Some first person view images from the Office Day scenario depicting frames belonging to the classes “laboffice”, “office” and “studio”.

Fig. 9. Evaluation measures at varying of the number of videos.

Fig. 10. Evaluation measures obtained by varying the threshold T σ.

proachbyusingthecolorhistogramandthedistancefunctionused inRECfusion[13].Whenthesystemusessuchapproachthereisa highvarianceintheobtaineddistancevaluesevenifthereisn’tany segmentationblock tobematched amongvideos.Thiscausesthe matchingbetweenuncorrelatedblocksandhenceerrors.

5. Popularityestimation

Considering the improvements achieved by the proposed methodinthecontextofegocentricvideos,wehavetestedtheap- proachtosolvethepopularityestimationproblemdefinedin[13]. Thepopularityoftheobservedsceneinaninstantoftimedepends onhowmanydevicesarelookingatthatscene.Inthecontext of organizing the egocentric videos the popularity can be usefulto estimatethemostpopularsceneoverdays.

Toproperlytestourapproachforpopularityestimationwehave consideredthedatasetintroduced in[13].Thisallows afaircom- parisonwiththestate oftheartapproachesforthisproblembe-

Table 3

Experimental results on the popularity estimation problem.

[13] Proposed method

Dataset dev Pa / Pr Pg / Pr Po / Pr Pa / Pr Pg / Pr Po / Pr

Foosball 4 1.023 1 0.023 1 1 0

Meeting 2 1.011 0.991 0.020 0.991 0.991 0 Meeting 4 0.988 0.955 0.033 0.930 0.930 0 Meeting 5 0.886 0.755 0.131 0.698 0.704 0.006 SAgata 7 1.049 1 0.050 0.989 0.989 0

causetheresultscan bedirectlycompared withthe onesin[13]. Differentlythan[13],afterprocessingthevideoswiththeintraflow analysisandsegmentation refinementproposed inthispaper, we haveusedtheproposedCNNfeaturesandtheclusteringapproach of[13].Toevaluatetheperformancesofthecomparedmethodswe usedthreemeasures.Foreachclusteringstepwecompute:

Pr: groundtruth popularityscore(number ofcameraslooking atthemostpopularscene)obtainedfrommanuallabelling;

Pa:popularityscorecomputedbythealgorithm(numberofthe elementsinthepopularcluster);

Pg:numberofcorrectvideosinthepopularcluster(inliers).

From the above scores,the weightedmeanof theratios Pa/Pr

andPg/Proveralltheclusteringstepsarecomputed[13].Theratio Pa/Pr providesa score forthe popularityestimation,whereas the ratioPg/Prassessesthevisualcontentofthevideosinthepopular cluster. Thesetwo scoresfocus onlyon thepopularityestimation andthe number ofinliersin the mostpopular cluster. Since the aim is to infer the popularity of the scenes, it is useful to look alsoatthenumberofoutliersinthemostpopularcluster.Infact, theresultsreported in[13]showthat whenthe algorithmworks with a low number of input videos, the most popular cluster is sometimesaffectedbyoutliers.Theseerrorscouldaffectthepop- ularity estimation ofthe clustersand, therefore,the final output.

Thus,weintroducedathirdevaluationmeasurethattakesintoac- count the number ofoutliers in themost popularcluster. LetPo

be the numberof wrongvideosin the popularcluster (outliers).

Fromthisscore,wecomputetheweightedmeanoftheratioPo/Pr

over all theclusteringsteps, wherethe weights are givenby the lengthofthesegmentedblocks(i.e.,theweights arecomputedas suggestedin[13]).Thisvaluecanbeconsideredasapercentageof thepresenceofoutliersinthemostpopularclusterinferredbythe system.Theaimistohavethisvalueasloweraspossible.

Table3showstheresultsofthepopularityestimationobtained by the compared approaches.The first columnshowsthe results obtainedbytheapproachproposedin[13].Althoughthismethod achievesgoodperformance intermsofpopularityestimationand inliersrate, the measurePo/Pr highlights that themethod suffers from the presence of outliers in the most popular cluster. This meansthatthepopularityratio(Pa/Pr) isaffectedbythepresence ofsomeoutliersinthemostpopularcluster.Indeed,inmostcases

(9)

Fig. 11. Examples of clustering performed by our system and the approach in [13] . Each column shows the input frames taken from different devices. The color of the border of each frame identifies the cluster. The proposed method has correctly iden- tified the three clusters.

the Pa/Pr value is higherthan 1 andPo/Pr is greater than 0.The second column ofTable 3is relatedto the resultsobtained with theproposed approach.We obtainedvaluesofpopularityestima- tionandinliersratecomparablewithrespectto[13].Inthiscase, the valuesofpopularity scoreare all lower than1, whichmeans thatsometimestheclusteringapproachlostsomeinliers.

However it worth to notice that for all the experiments per- formedwiththeproposedapproach,theoutlierratioPo/Prisvery closetozeroandlowerwithrespecttothevaluesobtainedbyOr- tis etal.[13].This meansthat the mostpopularcluster obtained by theproposedapproachisnot affectedbyoutliers,andthisas- sureusthattheoutput videobelongstothe mostpopular scene.

Byperformingavisualassessmentoftheresults,weobservedthat usingthe proposedCNN featuresinthe clusteringphaseinvolves a fine-grained clustering, which better isolates the input videos duringthetransitionsbetweentwoscenesorduringa noisetime interval. This behaviour is not observed in the outputs obtained by [13]. Using the color histogram indeed, the system defines a limited number of clusters that are, therefore, affected by out- liers. Some examples about this behaviour are shown in Fig. 11 andFig.12.Inthesefigures,each columnshowstheframestaken from the considered devices at the same instant. The border of each frame identifies the cluster. The first columnof each figure shows the result ofthe proposed approach, and the second col- umnshowstheresultofthemethodproposedin[13].Specifically Fig.11showsanexampleofclusteringperformedbythecompared approachesduringtheanalysisofthescenarioFoosball.Inthisex- ample,thefirstandthethirddevicesareviewingthesamescene (the library),the second device is viewingthe sofa, whereas the fourthdevice is performing atransition betweenthetwo scenes.

In such case, theproposed methodcreates adifferent cluster for the devicethat isperforming thetransition, whereas themethod proposedin[13]includesthesceneinthesameclusterofthesec- onddevice. Fig.12 showsan exampleofthepopularityclustering performedduringtheanalysisofthescenarioMeeting.Inthiscase both the approachesfail to insert the second frame in the pop-

Fig. 12. Examples of clustering performed by our system and the approach in [13] . Each column shows the input frames taken from different devices. The color of the border of each frame identifies the cluster. The proposed method produces better results in terms of outlier detection.

ular cluster dueto the huge difference inscale, but the method in[13]createsaclusterthat includesthesecondandthefifthde- vice(despite they are viewingtwo differentscenes) whereas the proposedmethodcandistinguishthesecondimage fromthefifth one.

Besides the improvement in terms of performance, proposed approachprovides alsoanoutstandingreduction ofthecomputa- tiontime.Infact,thetime neededtoextractacolorhistogramas suggestedin[13] isabout1.5s, whereas thetime needed toex- tracttheCNNfeatureusedinourapproachisabout0.08s.Regard- ingthepopularityintraflow analysisonthemobile videodataset, theproposedapproachachievessimilarsegmentationresultscom- pared to the SIFT based approach. However, as in the wearable domaintest, theproposed segmentation methodstrongly outper- forms [13] when the system have to deal with noise (e.g., hand tremors).

6. Conclusionsandfutureworks

Thispaperproposesacompleteframeworktosegmentandor- ganizeasetofegocentric videosofdailylivingscenes.Theexper- imental results show that the proposed system outperforms the state of the art in terms of segmentation accuracy and compu- tational costs, both on wearable and mobile video datasets. Fur- thermore,ourapproachobtainsanimprovementonoutliersrejec- tiononthepopularityestimationproblem[13].Futureworkscan considerto extendthesystemtoperform recognitionofcontexts from egocentric images [2,31–33] and to recognize the activities performedbytheuser[14].

(10)

AppendixA. Pseudocodes

The Algorithm1 describestheintraflow analysisprocess (see

Data:InputVideo

Result:SegmentationVideo CnewTemplateAt(0); SEEKTrue;

t←0;

whiletvideoLength− stepdo tt+step;

f c7textractFeatureAt(t); ifSEEKthen

simcosSimilarity(C. f c7,f c7t); ifsim<Tf then

CnewTemplateAt(t); else

ifCisastabletemplatethen SEEKFalse;

TSTS{C};

TC;

T.scene_idsceneAssignment(); end

end else

simcosSimilarity(T. f c7,f c7t); slopeslopeAt(t);

iftheslopefunctionhasapeakthen SEEKTrue;

C. f c7f c7t; TC;

end

assignIntervalSceneId(T.scene_id,t− step,t− 1); end

end

applySegmentationRe finements();

Algorithm1:IntraflowAnalysisPseudocode.

Section2.1)exploitingtheinvolvedvariables andutility functions asreportedinthefollowing:

A template is representedby a Template Object. It keeps

informationaboutthetimeoftherelatedimageframe,thefc7 feature,andthesceneIDassignedduringtheprocedure.

AnewTemplate ObjectiscreatedbythefunctionnewTem-

plateAt(t). Thisfunction initializes a Template Object with theinformationrelatedtothetimet.

TheflagSEEKisTrueiftheprocedureisperformingtheresearch ofanewstabletemplate.Inthiscase,theTemplate Object Crepresentsthecurrenttemplatecandidate,whichisassigned tothecurrenttemplateinstanceTifits stabilityhasbeenver- ified accordingto the conditionsdefined in [13]. In thiscase, thenewtemplateinstanceTisaddedtotheTemplatesSetTS. IftheflagSEEKisFalse,theprocedurecomparesthecurrentref- erencetemplateTwiththeforwardframesuntila peakinthe slopesequenceisdetected.

The function sceneAssignment() performs the scene ID assign- ment aftera new templateis defined. In particular,it assigns Rejected to all the frames inthe interval betweenthe instant whenthenewtemplatehasbeenrequested tothetimewhen ithasbeendefined.Then,thefunctionfindsthesceneIDofthe scene depicted by the new defined template, eventually per- formingthebackwardprocedure(asdescribedinSection2.1.3).

Afterthesegmentationoftheinputvideo,therefinements de- scribed in Section 2.1.4 are applied by the function applySeg- mentationRefinements().

The Algorithm 2 describesthe betweenvideo flows analysis

Data:SetSofSegmentedVideos Result:ClustersoftheSegments initializealllinkstrengthstozero;

foreachvideovA inSdo foreachblockbvAinvA do

iflinkStrength[bvA]!=0then scene_IdbvA.scene_Id;

else

assignanewscene_IdtobvA; end

foreachvideovBj inS\{vA}do

bvB

j =argmaxb

vB jvB

j{CosSimilarity(bvA,bvB

j)| σ(bvA,vB

j)≥ Tσ}ifCosSimilarity(bvA,bvB

j)>

linkStrength[bvB

j]then assignscene_IdtobvB

j; linkStrength[bvB

j]← CosSimilarity(bvA,bvB

j); end

end end end

Algorithm2:BetweenFlowsVideoAnalysispseudocode.

process(see Section2.2). Thisproceduretakesasinput asetSof videos,previously processedwiththeintraflow analysisdescribed inAlgorithm1.Todescribethisprocedurewedefinedadatastruc- ture called linkStrength. This data structure is an array indexed withtheblocksextractedfromtheanalysedvideos.Consideringas exampleablock segmentbv,thevalue oflinkStrength[bv]isequal tozeroiftheblockbvhasnotyetbeenassignedtoasceneID.Oth- erwise,itisequaltothecosinesimilarityvaluebetweenbvandthe matched block whichcaused theassignment. Indeed,all the link strengths areinitialized tozero.Then,ifthesceneID assignedto bvchanges, thevalue oflinkStrength[bv] isupdated withthenew cosine similarityvalue, aswell asthescene ID value assignedto bv.

References

[1] D. Damen , T. Leelasawassuk , O. Haines , A. Calway , W.W. Mayol-Cuevas , You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video., in: British Machine Vision Conference, 2014 . [2] A. Furnari , G.M. Farinella , S. Battiato , Recognizing personal contexts from

egocentric images, in: Proceedings of the IEEE International Conference on Computer Vision Workshops - Assistive Computer Vision and Robotics, 2015, pp. 393–401 .

[3] T. Kanade , Quality of life technology, Proc. IEEE 100 (8) (2012) 2394–2396 . [4] A. Furnari , G.M. Farinella , S. Battiato , Recognizing personal locations from ego-

centric videos, IEEE Trans. Hum. Mach. Syst. 47 (1) (2017) 6–18 .

[5] M.L. Lee , A.K. Dey , Lifelogging memory appliance for people with episodic memory impairment, in: The 10th ACM International Conference on Ubiqui- tous Computing, 2008, pp. 44–53 .

[6] V. Buso , L. Hopper , J. Benois-Pineau , P.-M. Plans , R. Megret , Recognition of activities of daily living in natural “at home” scenario for assessment of alzheimer’s disease patients, in: IEEE International Conference on Multimedia

& Expo Workshops, IEEE, 2015, pp. 1–6 .

[7] S. Karaman , J. Benois-Pineau , V. Dovgalecs , R. Mégret , J. Pinquier , R. André-O- brecht , Y. Gaëstel , J.-F. Dartigues , Hierarchical hidden markov model in detect- ing activities of daily living in wearable videos for studies of dementia, Mul- timed. Tools Appl. 69 (3) (2014) 743–771 .

[8] Y. Gaëstel, S. Karaman, R. Megret, O.-F. Cherifa, T. Francoise, B.-P. Jenny, J.- F. Dartigues, Autonomy at home and early diagnosis in alzheimer’s disease:

Utility of video indexing applied to clinical issues, the immed project, in:

(11)

Alzheimer’s Association International Conference on Alzheimer’s Disease 2011, p. S245.

[9] J. Pinquier , S. Karaman , L. Letoupin , P. Guyot , R. Mégret , J. Benois-Pineau , Y. Gaëstel , J.-F. Dartigues , Strategies for multiple feature fusion with hier- archical hmm: application to activity recognition from wearable audiovisual sensors, in: 21st International Conference on Pattern Recognition, IEEE, 2012, pp. 3192–3195 .

[10] N. Kapur , E.L. Glisky , B.A. Wilson , External memory aids and computers in memory rehabilitation, in: The Essential Handbook of Memory Disorders for Clinicians, 2004, pp. 301–321 .

[11] C. Gurrin , A.F. Smeaton , A.R. Doherty , Lifelogging: personal big data, Found.

Trends Inf. Retrieval 8 (1) (2014) 1–125 .

[12] M.D. White , Police officer body-worn cameras: assessing the evidence, Wash- ington, DC: Office of Community Oriented Policing Services, 2014 .

[13] A. Ortis , G.M. Farinella , V. D’Amico , L. Addesso , G. Torrisi , S. Battiato , Recfusion:

automatic video curation driven by visual content popularity, ACM Multimedia, 2015 .

[14] A . Fathi , A . Farhadi , J.M. Rehg , Understanding egocentric activities, in: IEEE In- ternational Conference on Computer Vision, 2011, pp. 407–414 .

[15] H. Pirsiavash , D. Ramanan , Detecting activities of daily living in first-person camera views, in: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2847–2854 .

[16] D. Ravi , C. Wong , B. Lo , G.-Z. Yang , Deep learning for human activity recogni- tion: a resource efficient implementation on low-power devices, in: Wearable and Implantable Body Sensor Networks (BSN), 2016 IEEE 13th International Conference on, IEEE, 2016, pp. 71–76 .

[17] M.S. Ryoo , L. Matthies , First-person activity recognition: what are they doing to me? in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2730–2737 .

[18] Y. Poleg , C. Arora , S. Peleg , Temporal segmentation of egocentric videos, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2014 . [19] A. Furnari , G.M. Farinella , S. Battiato , Temporal segmentation of egocentric

videos to highlight personal locations of interest, in: International Workshop on Egocentric Perception, Interacion, and Computing - European Conference on Computer Vision Workshop, 2016 .

[20] Z. Lu , K. Grauman , Story-driven summarization for egocentric video, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2013 .

[21] B. Soran , A. Farhadi , L. Shapiro , Generating notifications for missing actions:

don’t forget to turn the lights off!, in: International Conference on Computer Vision, 2015 .

[22] A. Krizhevsky , I. Sutskever , G.E. Hinton ,Imagenet classification with deep con- volutional neural networks, in: Advances in Naural Information Processing Sys- tems, 2012, pp. 1097–1105 .

[23] L. Deng , A tutorial survey of architectures, algorithms, and applications for deep learning, APSIPA Trans. Signal Inf. Process. 3 (2014) e2 .

[24] P. Agrawal , R. Girshick , J. Malik , Analyzing the performance of multilayer neu- ral networks for object recognition, in: European Conference on Computer Vi- sion, 2014, pp. 329–344 .

[25] S. Bell , P. Upchurch , N. Snavely , K. Bala , Material recognition in the wild with the materials in context database, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2015 .

[26] J. Hosang , M. Omran , R. Benenson , B. Schiele , Taking a deeper look at pedes- trians, in: The IEEE Conference on Computer Vision and Pattern Recognition, 2015 .

[27] J. Long , E. Shelhamer , T. Darrell , Fully convolutional networks for semantic seg- mentation, in: The IEEE Conference on Computer Vision and Pattern Recogni- tion, 2015 .

[28] J. Yosinski , J. Clune , Y. Bengio , H. Lipson , How transferable are features in deep neural networks? in: Advances in Neural Information Processing Systems, 2014, pp. 3320–3328 .

[29] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, ArXiv: 1409.1556 2014.

[30] C.D. Manning , P. Raghavan , H. Schütze , et al. , Introduction to information re- trieval, Cambridge University Press Cambridge, 2008 .

[31] G.M. Farinella , S. Battiato , Scene classification in compressed and constrained domain, IET Comput. Vision 5 (5) (2011) 320–334 .

[32] G.M. Farinella , D. Ravì, V. Tomaselli , M. Guarnera , S. Battiato , Representing scenes for real-time context classification on mobile devices, Pattern Recognit.

48 (4) (2015) 1086–1100 .

[33] B. Zhou , A. Lapedriza , J. Xiao , A. Torralba , A. Oliva , Learning deep features for scene recognition using places database, in: Advances in Neural Information Processing Systems, 2014, pp. 4 87–4 95 .

Riferimenti

Documenti correlati

The training of the predictive model uses a training dataset where every example refers to a single individual and consists of (i) a vector of the individual’s mobility features

The survey (Appendix S1) was composed of four sections: (a) demographic, training and employment details of re- spondents (Q1-8); (b) perception of the risk related to SARS-

Tree recruitment processes above the current treeline were studied at fine spatial scales in order to assess the influence of local environmental changes (Veblen

A further step in the digital processing of geological map information consists of building virtual visualization by means of GIS-based 3D viewers, that allow projection and draping

The liberal conception is strictly individualistic, because only individuals can have rights, and, in a strict sense, it does not recognize a space for education, neither in

One can easily verify that the voter model with speciation and Janzen-Connell effect is able to fit also the Relative Species Abundance (and in fact the resulting curve is

As regard livestock systems, climate change alters the thermal environment of animals, affecting animal health, reproduction, and the feed conversion efficien-