Top-Down and Bottom-Up Semantic Indexing of Multimedia

(1)

R.Leonardi,P.Migliorati

UniversityofBrescia

{leon,pier}@ing.unibs.it

Abstract

Theaim ofthis work consists inproposing adual approachfor the sake of semantic

indexingofaudio-visualdocuments. Wepresenttwodierentalgorithmsbasedrespectively

onabottom-upandatop-downstrategy. Consideringthetop-downapproach,wepropose

an algorithm which implementsa nite-state machine and useslow-level motion indices

extractedfromanMPEGcompressedbit-stream. Simulationresultsshowthattheproposed

method can eectively detect the presence of relevant events insport programs. Using

the bottom-up approach,the indexing is performed by meansof HiddenMarkov Models

(HMM), withaninnovativeapproach: the inputsignal isconsidered as anon-stationary

stochastic process, modeledbya HMMinwhicheachstate isassociated withadierent

property of audio-visual material. Several samples from the MPEG-7 content set have

been analyzedusing the proposedscheme, demonstrating the performance of the overall

approachtoprovideinsightsaboutthecontentofaudio-visualprogrammes. Moreover,what

appearsquiteattractive insteadistouse low-leveldescriptorsinprovidingafeedback for

non-expertusersofthecontentofthedescribedaudio-visualprogramme. Theexperiments

havedemonstratedthat,byadequatevisualizationorpresentation,low-levelfeaturescarry

instantlysemanticinformationaboutthe programmecontent, givenacertain programme

category,whichmaythushelptheviewertouse suchlow-levelinformationfornavigation

orretrievalofrelevantevents.

1 Introduction

Eectivenavigationthroughmultimediadocumentsis necessaryto enablewidespreaduse and

accesstoricherandnovelinformationsources. Designofecientindexingtechniquestoretrieve

relevantinformationisanotherimportantrequirement. Allowingforpossibleautomatic

proce-durestosemanticallyindexaudio-videomaterialrepresentsthereforeaveryimportantchallenge.

Ideally such methods should be designedto create indices of theaudio-visual material, which

characterize thetemporal structure of amultimedia documentfrom asemanticpoint of view

[1].

Traditionally,themostcommonapproachtocreateanindexofanaudio-visualdocumenthas

beenbasedontheautomaticdetectionofthechangesofcamerarecordsandthetypesofinvolved

editingeects. Thiskindofapproachhasgenerallydemonstratedsatisfactoryperformanceand

leadto agood low-leveltemporalcharacterizationof thevisual content. Howeverthereached

semanticcharacterizationremainspoorsincethedescriptionisveryfragmentedconsideringthe

highnumberofshottransitionsoccurringintypicalaudio-visualprograms.

Alternatively, there have been recent research eorts to base the analysis of audio-visual

documentsonajointaudioandvideoprocessingsoastoprovideforahigherlevelorganization

of information[2] [3]. In [3]these twosources ofinformation havebeenjointly considered for

theidentication of simple scenes that compose an audio-visualprogram. Thevideoanalysis

associated to cross-modal procedures can be very computationally intensive (by relying, for

example,onidentifyingcorrelationbetweennon-consecutiveshots).

In this paper we propose and compare the performance of two dierent approaches for

semanticindexingofaudio-visualdocuments. Intherstapproach,theproblemofidentication

ofrelevantsituationsinaudio-visualprogrammeswithinspeciccontexts(e.g.,sport)hasbeen

(2)

semanticfeaturesandmotionindicesassociatedtoavideosequence,byestablishingasequence

of transitions betweencamera motioncharacteristics associated to series ofconsecutiveshots,

orportionsofshots. Forsoccervideosequences,thesemanticcontentcanbeidentiedwiththe

presenceof interestingeventssuchas,for example,goals,shots towardsgoal,andsoon; these

eventscanbefoundatthebeginningorattheendofgameaction. Agoodsemanticindex ofa

soccervideosequencecan thereforebeestablishedbyasummarymadeupofalistofallgame

actions,each characterizedbyitsbeginningand endingevent[4].

Inthesecond partof thework,theattention hasbeenfocused to suitably combine, using

acomplementarybottom-up approach, audioand visualdescriptors and associated individual

shots/audiosegmentsto extract higherlevelsemanticentities: scenes oreven individual

pro-grammeitems[5]. Inparticular,theindexingisperformedbymeansofHiddenMarkovModels

(HMM), used in an innovative approach: the input signal is considered as a non-stationary

stochastic process, modeled bya HMMin which each statestands for adierent class of the

signal[6]. Fromthe obtainedresults,it appears that agood characterization canbe achieved

also by providing adirect feedbackof low-levelaudio and visualdescriptors to the to viewer,

ratherthantryingto automaticallydenetheassociatedsemanticsoftheinformation.

Thepaperisorganizedasfollows. InSections2and3theproposedalgorithmsarepresented,

whereasthesimulationresultsarediscussedinSection4. ConclusionsaredrawninSection 5.

2 A top-down approach forSoccer videoindexing using

mo-tion information

As previously mentioned, the problem of semantic video indexing is of great interest for its

usefulnessintheeldofecientnavigationandretrievalfrommultimediadatabases(description

andretrievalofimagesorshots,...). Thistask,whichseemsverysimpleforthehumanbeing,

isnottrivialtoimplementinanautomaticmanner.

Afterthe extraction of some low-levelfeatures, adecision making algorithm is needed for

semanticindexing, and thisalgorithm canbeimplementedin two ways: modeling thehuman

cognitivebehaviour;usingstatisticalapproachestorelatefeaturesofsomedatadescriptorwith

outcomesofthehumandecisionprocess.

Ourwork is based onthe latterapproach; namely, we are attempting to make asemantic

indexingofavideosequencestartingfromsomedescriptorsofthemotioneldextractedfromit.

Inparticular,wehavechosenthreelow-levelmotionindicesandwehavestudiedthecorrelation

betweentheseindicesandthesemanticeventsdened above[4].

Low-level motion descriptors

Motion data related to a video sequence are typically the motion vectors associated at each

image. Inourwork,themotioninformationisdirectlyextracted fromthecompressedMPEG2

domain, where each frame is divided into MacroBlocks (MB), and for each MB is given: the

MBtype;amotionvectorobtainedwithablock-matchingalgorithmwithrespecttoareference

frame. TheMBtypeisstrictlyrelatedtomotionvectorsasfollows: ifaMBisIntra-Coded,no

motionvectorisgiven;ifaMBisNo-Motion-Coded,themotionvectorisnull;otherwisea

not-nullmotionvectorisgiven. Themotioneldcanberepresentedwithvarioustypeofdescriptors.

Thesecompactrepresentationsshouldanywaybecombinedwithotherindicessuitabletostate

theirreliabilityorto addotherusefulinformation.

Inourworkwehaveconsideredthreelow-levelindiceswhichrepresentthefollowing

charac-teristics: (i) lackof motion,(ii)cameraoperations(representedbypanand zoom parameters)

and(iii)thepresenceofshot-cuts.

Lack of motion has been evaluated by thresholding the mean value of the motion vector

module. Cameramotionparameters,representedbyhorizontal"pan"and"zoom"factors,have

(3)

SI S1 S2 SF & % 6 ? fastpan orzoom lackof

motion shotcut

shotcut fastpan

orzoom timeout

S1

Figure1: Theproposedgoal-ndingalgorithm.

factor)[4]. Inourimplementation,shot-cutshavebeendetectedusingonlymotioninformation

too. In particular, we have used the sharp variation of the above mentioned motion indices

and of the number of Intra-Coded MB of P-frames [7]. To evaluate the sharp variation of

the motion eld we have used the dierence betweenthe average value of the motion vector

modules betweentwoadjacentP-frames. This parameterwill assumesignicantlyhigh values

inpresenceofashot-cutcharacterizedbyanabruptchange inthemotionvectoreld between

thetwoconsideredshots. Thisinformationregardingthesharpchangeinthemotionvectoreld

hasbeensuitablycombinedwiththenumberofIntra-CodedMBofthecurrentP-frames. When

thisparameterisgreaterthanaprexedthresholdvalue,weassumethatthereisashot-cut[4].

The proposed goal-nding algorithm

As it can be seen, the above mentioned low-level indices are not sucient, individually, to

reachsatisfactoryresults. Tondparticularevents,suchas,forexample,goalsorshotstoward

goal,wehavetriedtoexploit thetemporal evolutionofmotionindicesin correspondencewith

such events. We havenoticedthat in correspondence withgoalswe cannd fastpanorzoom

followedbylackof motion,followedbyashotcut. Theconcatenationoftheselow-levelevents

havethereforebeenidentiedandexploited usingthenite-statemachineshownin Fig. 1.

FromtheinitialstateSI,themachinegoesintostateS1iffastpanorfastzoomisdetected

forat least 3consecutiveP-frames. Then, from stateS1themachine goesinto thenal state

SF,whereagoalisdecleared,ifashot-cutispresent;fromstateS1itgoesintostateS2iflackof

motionisdetectedforatleast3consecutiveP-frames. FromstateS2themachinegoesintonal

stateSFifashot-cutis detected,whileitreturnsinto stateS1iffastpanorzoomis detected

foratleast3consecutiveP-frames(inthiscasethegameactionisprobablystillgoingon). Two

"timeouts"are usedto returnintotheinitial stateSIfrom statesS1andS2in casenothing is

happeningforacertainnumberofP-frames(correspondingtoabout1minuteof sequence).

3 A bottom-up approach: Content Based Indexing using

HMM

Inthesecondpartofthispresentationthefocushasbeenplacedonprovidingtoolsforanalyzing

bothaudioandvisualstreams,fortranslatingthesignalsamplesinto sequencesofindices.

The whole processing system canbe seenas composed of dierent steps of analysis, each

ofwhichextractsinformation atadened levelofsemanticabstraction. Firstofall, theinput

stream is demultiplexed into the two main components, audio and video. An independent

segmentation and classicationof the twochannels, audio and video, representthe next step

(4)

analysischannel,afeaturevectoriscalculatedbycomparisonofeachcoupleofadjacentframes,

intermsofluminancehistograms,motionvectorsandpixel-to-pixeldierences.

Eachsequenceoffeaturevectorsextractedfromthetwostreamsisthenclassiedbymeans

ofaHiddenMarkovModel(HMM)[8],inaninnovativeapproach: theinputsignalisconsidered

asa non-stationarystochastic process, modelled bya HMM in which each state standsfor a

dierentclassofthesignal. Givenasequenceofunsupervisedfeaturevectors,thecorrespondent

mostlikelysequence ofindices identifying particular signalclassescanbe generatedusing the

Viterbialgorithm[6].

Foraudio classicationweconsider four classes, namely: music, silence,speech and

back-ground noise. The result of the audio classication consists in the association of one of this

classesto each previouslyextractedfeature vector. In otherwords, atthe endof this analysis

weget asegmentationoftheaudio signalinto these fourclasseswith atemporalresolution of

0.5seconds(theresolutionisgivenbytheshiftintimeofeachclipwithrespecttotheprevious

one). Therststepof theanalysisis thesegmentationof theaudiolein equallengthframes,

partially overlappedin orderto reduce thespectraldistortion dueto thewindowing. The

du-ration of each frame is N samples(typicallyN is set in order to have time duration of 30-40

ms), and each frame is overlappedfor 2/3ofits durationwith thenext frame. Inthis way it

is possibleto associatealabel(music, silence,speech or backgroundnoise) to eachframe and

these labels can be changed every N/3 samples. After the segmentation step, an audio class

descriptorcanbeassociatedtoeachframeusinganergodicHMM.

Inthevideoanalysisweconsidertheproblem ofsegmentingavideosignalintoelementary

units,i.e., video shots;with this aim, wetraina two-state HMMclassier, in which thestate

"1"isassociateda"no detectedtransition"and theotherstate isassociated a"detectedshot

transition". In this way the systemis able to identify both abrupt transition (cut) and slow

transition(fading)betweenconsecutiveshots. Again,theseareobtainedbyapplyingtheViterbi

algorithm to the sequence of feature vectors: when the second state is reached the system

recognizes a shot transition. The same idea is used to establish a correlation between

non-adjacent shots, in order to identify subsequent occurrence of a same visual content between

non-consecutiveshots.

Thetaskofthefollowingstepoftheanalysisfocuses onextracting contentfromsegmented

videoshots,andthenindexingthemaccordingtotheinitialaudioandvideoclassication. First

the results of the previous audio and video classications must be time-aligned, obtaining a

jointlist ofindiceswith audioand videoclassicationinformationassociatedto each one. We

introduceanothersemanticentitycalled "scene",whichiscomposed byagroupofconsecutive

shots. Scenes representa level of semantic abstraction in which audio and visual signals are

jointly consideredin order to reach somemeaningful informationabout theassociated stream

ofdata.

A rst approach to scene identication consists in the denition of four dierent types of

scenes: dialogues, inwhichtheaudiosignalismostlyspeech, andthechangeof theassociated

visualinformationoccursin analternatedfashion(e.g.,ABAB... );stories,inwhichtheaudio

signalismostlyspeechwhiletheassociatedvisualinformationexhibitstherepetitionofagiven

visual content, to create ashotpattern of thetypeABCADEFAG...; actions, when theaudio

signal belongs mostly to one class (which is non speech), and the visual information exhibits

aprogressivepattern ofshots with contrastingvisual contentsof thetypeABCDEF...; nally

consecutiveshotswhichdonotbelongto oneoftheaforementionedscenesbuttheirassociated

audioisofaconsistenttypeareclassiedasgenericscene. Oncewehavedened thesekindsof

scenes,wecanlook fortheminthetime-alignedsequenceofdescriptorsobtainedaspreviously

mentioned.

A second and moregeneral approach to scene classication is represented by astatistical

pattern recognition analysis applying a clustering procedure on the basis of the sequence of

descriptorsobtainedwith along-term analysisonmultimediadata. This wayweexaminethe

linearseparabilityofdierentsceneclassesevaluatingtwoindicesofdisorderandtheassociated

(5)

4 Experimental results

Inthis section the simulation resultsobtainedusing the proposed methods are described and

discussed.

Top-down approach: Indexing of soccer games sequences

Theperformanceoftheproposedgoal-ndingalgorithmhavebeentestedon2hoursofMPEG2

sequencescontainingthesemanticeventsreportedinTable1. Almostalllivegoalsaredetected,

and thealgorithm is ableto detect someshots to goaltoo,while it givespoorresultson free

kicks. The numberof false detection isquite relevant, but wehaveto takeinto accountthat

theseresultsareobtainedusingmotioninformationonly,sothesefalsedetectionwillprobably

beeliminatedusingothertypeofmediainformation(e.g.,audioinformation).

Events Present Detected

Live Replay Total Live Replay Total

Goals 20 14 34 18 7 25

Shotstogoal 21 12 33 8 3 11

Penalties 6 1 7 1 0 1

Total 47 27 74 27 10 37

False 116

Table1: Performanceoftheproposedgoal-ndingalgorithm.

Bottom-up approach: Content Based Indexing using HMM

Audioclassication results

Thisanalysishavebeenbasedon60minutesofaudio,onwhichwehavecomparedthesupervised

classicationandtheresultsobtainedusingtheHMM.Eachsimulationhasbeencarriedonaudio

segmentwithdurationof3'.

Table2showstheclassicationpercentagesofeachclass(music,silence,speech, andnoise):

thevaluesinthemaindiagonalarethepercentages ofcorrectrecognition.

Rec.Music Rec.Silence Rec.Speech Rec.Noise Recall Prec.

Music 80.5 3.6 11.4 4.5 0.805 0.843

Silence 2.4 95.8 1.8 0 0.958 0.867

Speech 11.3 2.3 85.4 1 0.854 0.856

Noise 3.1 2.9 2.5 91.5 0.915 0.928

Table2: Recognitionpercentages,andRecallandPrecision.

The classier performance are summarized using a couple of indices evaluated for each

class. These indices are called precision and recall, and are dened as follows: precision =

N c N c +N f ; recall= N c N c +N m where N c

is the number of correct detection, N

m

is the number of missed detection and

N

f

is thenumber of wrong detection. Both these indices are in the range[0,1]. Thevarious

performanceindiceshavebeenevaluatedandtheresultsareshowninTable2.

ItisclearfromTable2thatwithnoiseandsilencethealgorithmshowthebestperformance,

while resultsin musicand speechdetection arepoorer. Thisresult is due to thehigh level of

misclassicationbetweenmusicandvoice. Theseerrorscanbeduetothefollowingproblems: i)

(6)

Videoclassication results

The performance analysis of the video classicator is based on a video stream with duration

of 60 minutes. In the same way of audio analysis, each simulation has been carried out on

videosegmentwith duration of 3'and then allthe results havebeencombinedto obtainthe

whole classication The video classication is composed from two steps: shot segmentation

by meansof transition recognition, and then correlationbetween-shotsin order to obtainthe

correlation among non contiguousshots. While the initial segmentation has made using the

HMM classicator, the between-shots analysis uses only the probability density functions of

suchmodel,wherevectorelementsofinitial stateandoftransitionmatrixaresetto 1/N.

The resultsof both analysis are shown in Table 3, where state 1stands for "no detected

transition"andstate0standsfor"detected shottransition".

Rec.State0 Rec.state1

State0 95 5

State1 1.1 98.9

Table3: Resultsin videoshotclassication.

Wehavetwopossibleerrors: state1recognizedas0andstate0recognizedas1. As Table

3indicates, it is more probablethat the systemreveals false shot change rather than it does

notdetectanexisting ones. Thereducedvalueof precisionis dueto arelativelyhighnumber

offalse shotdetections. Thesefalsedetections aredue toveryfastcameramotion,luminance

changesandmotionofbigobjectsin thescene.

Scene identicationresults

Usingtheresultsofaudioandvideoclassication,itispossibleto evaluatetheperformanceof

the scenes identication system by means of segmentation rules, asdescribed in the previous

chapters. The rst step is to align audio and video descriptors, creating a descriptor shot

sequence. Using this sequence it is possible to search the scene categories already dened

(dialogs, actions, stories egeneric scenes). Foreach kind of scene, four dierent performance

indices have been calculated: recall, precision, recall

cover

eprecision

cover

. The rst two are

denedasinthecaseoftheaudioandvideoclassicationsimulations,whilethelattertwohave

beenintroduced toconsider situationswhensomeshots arerecognizedcorrectlyand otherare

recognized improperly. A scene is considered to lowcorrectly recognized if at least one of its

shots is correctly identied (i.e., it belongs to the right kind of scene). Moreover, when two

consecutive scenes are recognized asa unique scene, only oneis considered correct (if it has

been correct classied) while the other is declared missed. The second couple of indices has

beenintroduced because sometimesthe identied scene is onlypartially overlappingwith the

real scene (some shots belongingto the identied scene do notbelongto the real scene). Let

N

c

bethe numberofshots belongingto onekindofcorrectly identiedscene, N

a

thenumber

of shotbelonging to this kind of scene, and N

r

the whole number of shots belonging to the

identiedscenesofthiskindofscene,thenwedeneprecision

cover = N c Nr ; recall cover = N c Na

Table4showstheresultsforeach typeofsceneandforeachindex.

Thesceneidenticationprocedureisbasedondeterministicrulesratherthanonastochastic

classier, and it provides limited results when it is used in order to understand the

audio-visualassemblymodalitythat thedirectorused tocreatethescenes. Theerrorscanbedue to

inaccuraciesin apriorirules,andobviouslytoerrorsinpreviousstepsofclassication.

Figures 2,3,4 show the results of the classications of three TV programmes representing

respectively3minutesof atalk-show,ofamusicprogram,andof ascienticdocumentary. As

wecansee,thedierentstatisticalevolutionofaudioandvideoindices canbeeectivelyused

toinferthesemanticcontentoftheanalyzedsignals.

(7)

Action 0.884 0.801 0.789 0.743

Story 0.790 0.755 0.759 0.693

Genericscene 0.745 0.715 0.690 0.668

Table 4: Values for each index and for each kind of scene evaluated using the resultsof the

previousstepsofaudioand videoclassication.

Figure2: Audioandvideoclassicationresultsof3minuteofatalkshow. Inthevideostream

diagram,the shotlabelis associated to thevisual contentof the relativeshot, in such away

thatshots withsimilarvisualcontentaredenoted bythesamelabel.

havingpredened inaspecicapplicationcontextthehigh-levelsemanticinstancesofeventsof

interest(e.g.,goalsinasoccergame). Onlyintermediarysemanticcharacterizationisobtainable

otherwise,toidentifymacro-scenesdeningdialogue,storyoractionsituations.

Whatappearsquiteattractiveinsteadistouselow-leveldescriptorsinprovidingafeedback

fornon-expert usersof thecontentofthedescribedaudio-visualprogramme. Theexperiments

havedemonstratedthat,byadequatevisualizationorpresentation,low-levelfeaturescarry

in-stantlysemanticinformationabouttheprogrammecontent,givenacertainprogrammecategory,

which maythus helptheviewertousesuchlow-levelinformationfornavigationorretrievalof

relevantevents.

5 Conclusion

Inthis paperwepresentedtwo dierentindexing algorithms basedrespectivelyonbottom-up

and top-down approaches. Regarding the top-down approach, we propose a semantic

index-ing algorithm based on nite-state machines and low-level motion indices extracted from the

MPEG compressed bit-stream. Consideringthe bottom-upapproach, the signalclassication

isperformedby meansof HiddenMarkovModels (HMM). Several samplesfrom theMPEG-7

content set have been analyzed using the proposed classicationschemes, demonstrating the

performanceoftheoverallapproachtoprovideinsightsofthecontentoftheaudio-visual

mate-rial. Theclassicationresultsarequitesatisfactory,anditappearsthatajointvisualizationof

audioandvisualpropertiesprovidereachinsightsontheaudio-visualprogrammecontent.

References

[1] N.Adami,A. Bugatti,R. Leonardi,P. Migliorati,L. Rossi,"Describing Multimedia

(8)

Figure 3: Audio and videoclassicationresultsof 3minute ofa musicprogram. Inthevideo

streamdiagram,theshotlabelisassociatedtothevisualcontentoftherelativeshot,insucha

waythat shotswithsimilarvisualcontentaredenotedbythesamelabel.

Figure4: Audio andvideoclassicationresultsof3minuteofascienticdocumentary. Inthe

videostreamdiagram,theshotlabelisassociated tothevisualcontentoftherelativeshot,in

suchawaythatshotswithsimilarvisualcontentaredenoted bythesamelabel.

[2] Yao Wang, Zhu Liu, Jin-cheng Huang, "Multimedia Content Analysis Using Audio and

VisualInformation",Toappearin SignalProcessingMagazine.

[3] C.Saraceno,R. Leonardi,"IndexingAudio-VisualDatabasesThroughaJointAudio and

VideoProcessing",InternationalJournalofImagingSystemsand Technology, Vol.9,No.

5,pp.320-331,Oct.1998.

[4] A.Bonzanini,R.Leonardi,P.Migliorati,"SemanticVideoIndexingUsingMPEGMotion

Vectors",Proc.EUSIPCO'2000,pp.147-150,4-8Sept.2000,Tampere,Finland.

[5] R.Zhao,W.I.Grosky,"FromFeaturestoSemantics: SomepreliminaryResults", Proc.of

IEEEInternationalConferenceICME2000,NewYork,NY,USA,30July-2August2000.

[6] F.Oppini,R.Leonardi,"AudiovisualPatternRecognitionUsingHMMforContent-based

MultimediaIndexing",Proc.ofPacketVideo2000,Cagliari,Italy.

[7] YiningDeng,B.S.Manjunath,"Content-BasedSearchofVideoUsingColor,Texture,and

Motion",Proc.ofIEEEInternationalConferenceICIP-97,SantaBarbara,California,USA,

pp.534-536,October26-29,1997.

[8] L.R.Rabiner,"ATutorialonHiddenMarkovModelsandSelectedApplicationsinSpeech