R.Leonardi,P.Migliorati
UniversityofBrescia
{leon,pier}@ing.unibs.it
Abstract
Theaim ofthis work consists inproposing adual approachfor the sake of semantic
indexingofaudio-visualdocuments. Wepresenttwodierentalgorithmsbasedrespectively
onabottom-upandatop-downstrategy. Consideringthetop-downapproach,wepropose
an algorithm which implementsa nite-state machine and useslow-level motion indices
extractedfromanMPEGcompressedbit-stream. Simulationresultsshowthattheproposed
method can eectively detect the presence of relevant events insport programs. Using
the bottom-up approach,the indexing is performed by meansof HiddenMarkov Models
(HMM), withaninnovativeapproach: the inputsignal isconsidered as anon-stationary
stochastic process, modeledbya HMMinwhicheachstate isassociated withadierent
property of audio-visual material. Several samples from the MPEG-7 content set have
been analyzedusing the proposedscheme, demonstrating the performance of the overall
approachtoprovideinsightsaboutthecontentofaudio-visualprogrammes. Moreover,what
appearsquiteattractive insteadistouse low-leveldescriptorsinprovidingafeedback for
non-expertusersofthecontentofthedescribedaudio-visualprogramme. Theexperiments
havedemonstratedthat,byadequatevisualizationorpresentation,low-levelfeaturescarry
instantlysemanticinformationaboutthe programmecontent, givenacertain programme
category,whichmaythushelptheviewertouse suchlow-levelinformationfornavigation
orretrievalofrelevantevents.
1 Introduction
Eectivenavigationthroughmultimediadocumentsis necessaryto enablewidespreaduse and
accesstoricherandnovelinformationsources. Designofecientindexingtechniquestoretrieve
relevantinformationisanotherimportantrequirement. Allowingforpossibleautomatic
proce-durestosemanticallyindexaudio-videomaterialrepresentsthereforeaveryimportantchallenge.
Ideally such methods should be designedto create indices of theaudio-visual material, which
characterize thetemporal structure of amultimedia documentfrom asemanticpoint of view
[1].
Traditionally,themostcommonapproachtocreateanindexofanaudio-visualdocumenthas
beenbasedontheautomaticdetectionofthechangesofcamerarecordsandthetypesofinvolved
editingeects. Thiskindofapproachhasgenerallydemonstratedsatisfactoryperformanceand
leadto agood low-leveltemporalcharacterizationof thevisual content. Howeverthereached
semanticcharacterizationremainspoorsincethedescriptionisveryfragmentedconsideringthe
highnumberofshottransitionsoccurringintypicalaudio-visualprograms.
Alternatively, there have been recent research eorts to base the analysis of audio-visual
documentsonajointaudioandvideoprocessingsoastoprovideforahigherlevelorganization
of information[2] [3]. In [3]these twosources ofinformation havebeenjointly considered for
theidentication of simple scenes that compose an audio-visualprogram. Thevideoanalysis
associated to cross-modal procedures can be very computationally intensive (by relying, for
example,onidentifyingcorrelationbetweennon-consecutiveshots).
In this paper we propose and compare the performance of two dierent approaches for
semanticindexingofaudio-visualdocuments. Intherstapproach,theproblemofidentication
ofrelevantsituationsinaudio-visualprogrammeswithinspeciccontexts(e.g.,sport)hasbeen
semanticfeaturesandmotionindicesassociatedtoavideosequence,byestablishingasequence
of transitions betweencamera motioncharacteristics associated to series ofconsecutiveshots,
orportionsofshots. Forsoccervideosequences,thesemanticcontentcanbeidentiedwiththe
presenceof interestingeventssuchas,for example,goals,shots towardsgoal,andsoon; these
eventscanbefoundatthebeginningorattheendofgameaction. Agoodsemanticindex ofa
soccervideosequencecan thereforebeestablishedbyasummarymadeupofalistofallgame
actions,each characterizedbyitsbeginningand endingevent[4].
Inthesecond partof thework,theattention hasbeenfocused to suitably combine, using
acomplementarybottom-up approach, audioand visualdescriptors and associated individual
shots/audiosegmentsto extract higherlevelsemanticentities: scenes oreven individual
pro-grammeitems[5]. Inparticular,theindexingisperformedbymeansofHiddenMarkovModels
(HMM), used in an innovative approach: the input signal is considered as a non-stationary
stochastic process, modeled bya HMMin which each statestands for adierent class of the
signal[6]. Fromthe obtainedresults,it appears that agood characterization canbe achieved
also by providing adirect feedbackof low-levelaudio and visualdescriptors to the to viewer,
ratherthantryingto automaticallydenetheassociatedsemanticsoftheinformation.
Thepaperisorganizedasfollows. InSections2and3theproposedalgorithmsarepresented,
whereasthesimulationresultsarediscussedinSection4. ConclusionsaredrawninSection 5.
2 A top-down approach forSoccer videoindexing using
mo-tion information
As previously mentioned, the problem of semantic video indexing is of great interest for its
usefulnessintheeldofecientnavigationandretrievalfrommultimediadatabases(description
andretrievalofimagesorshots,...). Thistask,whichseemsverysimpleforthehumanbeing,
isnottrivialtoimplementinanautomaticmanner.
Afterthe extraction of some low-levelfeatures, adecision making algorithm is needed for
semanticindexing, and thisalgorithm canbeimplementedin two ways: modeling thehuman
cognitivebehaviour;usingstatisticalapproachestorelatefeaturesofsomedatadescriptorwith
outcomesofthehumandecisionprocess.
Ourwork is based onthe latterapproach; namely, we are attempting to make asemantic
indexingofavideosequencestartingfromsomedescriptorsofthemotioneldextractedfromit.
Inparticular,wehavechosenthreelow-levelmotionindicesandwehavestudiedthecorrelation
betweentheseindicesandthesemanticeventsdened above[4].
Low-level motion descriptors
Motion data related to a video sequence are typically the motion vectors associated at each
image. Inourwork,themotioninformationisdirectlyextracted fromthecompressedMPEG2
domain, where each frame is divided into MacroBlocks (MB), and for each MB is given: the
MBtype;amotionvectorobtainedwithablock-matchingalgorithmwithrespecttoareference
frame. TheMBtypeisstrictlyrelatedtomotionvectorsasfollows: ifaMBisIntra-Coded,no
motionvectorisgiven;ifaMBisNo-Motion-Coded,themotionvectorisnull;otherwisea
not-nullmotionvectorisgiven. Themotioneldcanberepresentedwithvarioustypeofdescriptors.
Thesecompactrepresentationsshouldanywaybecombinedwithotherindicessuitabletostate
theirreliabilityorto addotherusefulinformation.
Inourworkwehaveconsideredthreelow-levelindiceswhichrepresentthefollowing
charac-teristics: (i) lackof motion,(ii)cameraoperations(representedbypanand zoom parameters)
and(iii)thepresenceofshot-cuts.
Lack of motion has been evaluated by thresholding the mean value of the motion vector
module. Cameramotionparameters,representedbyhorizontal"pan"and"zoom"factors,have
SI S1 S2 SF & % 6 ? fastpan orzoom lackof
motion shotcut
shotcut fastpan
orzoom timeout
S1
Figure1: Theproposedgoal-ndingalgorithm.
factor)[4]. Inourimplementation,shot-cutshavebeendetectedusingonlymotioninformation
too. In particular, we have used the sharp variation of the above mentioned motion indices
and of the number of Intra-Coded MB of P-frames [7]. To evaluate the sharp variation of
the motion eld we have used the dierence betweenthe average value of the motion vector
modules betweentwoadjacentP-frames. This parameterwill assumesignicantlyhigh values
inpresenceofashot-cutcharacterizedbyanabruptchange inthemotionvectoreld between
thetwoconsideredshots. Thisinformationregardingthesharpchangeinthemotionvectoreld
hasbeensuitablycombinedwiththenumberofIntra-CodedMBofthecurrentP-frames. When
thisparameterisgreaterthanaprexedthresholdvalue,weassumethatthereisashot-cut[4].
The proposed goal-nding algorithm
As it can be seen, the above mentioned low-level indices are not sucient, individually, to
reachsatisfactoryresults. Tondparticularevents,suchas,forexample,goalsorshotstoward
goal,wehavetriedtoexploit thetemporal evolutionofmotionindicesin correspondencewith
such events. We havenoticedthat in correspondence withgoalswe cannd fastpanorzoom
followedbylackof motion,followedbyashotcut. Theconcatenationoftheselow-levelevents
havethereforebeenidentiedandexploited usingthenite-statemachineshownin Fig. 1.
FromtheinitialstateSI,themachinegoesintostateS1iffastpanorfastzoomisdetected
forat least 3consecutiveP-frames. Then, from stateS1themachine goesinto thenal state
SF,whereagoalisdecleared,ifashot-cutispresent;fromstateS1itgoesintostateS2iflackof
motionisdetectedforatleast3consecutiveP-frames. FromstateS2themachinegoesintonal
stateSFifashot-cutis detected,whileitreturnsinto stateS1iffastpanorzoomis detected
foratleast3consecutiveP-frames(inthiscasethegameactionisprobablystillgoingon). Two
"timeouts"are usedto returnintotheinitial stateSIfrom statesS1andS2in casenothing is
happeningforacertainnumberofP-frames(correspondingtoabout1minuteof sequence).
3 A bottom-up approach: Content Based Indexing using
HMM
Inthesecondpartofthispresentationthefocushasbeenplacedonprovidingtoolsforanalyzing
bothaudioandvisualstreams,fortranslatingthesignalsamplesinto sequencesofindices.
The whole processing system canbe seenas composed of dierent steps of analysis, each
ofwhichextractsinformation atadened levelofsemanticabstraction. Firstofall, theinput
stream is demultiplexed into the two main components, audio and video. An independent
segmentation and classicationof the twochannels, audio and video, representthe next step
analysischannel,afeaturevectoriscalculatedbycomparisonofeachcoupleofadjacentframes,
intermsofluminancehistograms,motionvectorsandpixel-to-pixeldierences.
Eachsequenceoffeaturevectorsextractedfromthetwostreamsisthenclassiedbymeans
ofaHiddenMarkovModel(HMM)[8],inaninnovativeapproach: theinputsignalisconsidered
asa non-stationarystochastic process, modelled bya HMM in which each state standsfor a
dierentclassofthesignal. Givenasequenceofunsupervisedfeaturevectors,thecorrespondent
mostlikelysequence ofindices identifying particular signalclassescanbe generatedusing the
Viterbialgorithm[6].
Foraudio classicationweconsider four classes, namely: music, silence,speech and
back-ground noise. The result of the audio classication consists in the association of one of this
classesto each previouslyextractedfeature vector. In otherwords, atthe endof this analysis
weget asegmentationoftheaudio signalinto these fourclasseswith atemporalresolution of
0.5seconds(theresolutionisgivenbytheshiftintimeofeachclipwithrespecttotheprevious
one). Therststepof theanalysisis thesegmentationof theaudiolein equallengthframes,
partially overlappedin orderto reduce thespectraldistortion dueto thewindowing. The
du-ration of each frame is N samples(typicallyN is set in order to have time duration of 30-40
ms), and each frame is overlappedfor 2/3ofits durationwith thenext frame. Inthis way it
is possibleto associatealabel(music, silence,speech or backgroundnoise) to eachframe and
these labels can be changed every N/3 samples. After the segmentation step, an audio class
descriptorcanbeassociatedtoeachframeusinganergodicHMM.
Inthevideoanalysisweconsidertheproblem ofsegmentingavideosignalintoelementary
units,i.e., video shots;with this aim, wetraina two-state HMMclassier, in which thestate
"1"isassociateda"no detectedtransition"and theotherstate isassociated a"detectedshot
transition". In this way the systemis able to identify both abrupt transition (cut) and slow
transition(fading)betweenconsecutiveshots. Again,theseareobtainedbyapplyingtheViterbi
algorithm to the sequence of feature vectors: when the second state is reached the system
recognizes a shot transition. The same idea is used to establish a correlation between
non-adjacent shots, in order to identify subsequent occurrence of a same visual content between
non-consecutiveshots.
Thetaskofthefollowingstepoftheanalysisfocuses onextracting contentfromsegmented
videoshots,andthenindexingthemaccordingtotheinitialaudioandvideoclassication. First
the results of the previous audio and video classications must be time-aligned, obtaining a
jointlist ofindiceswith audioand videoclassicationinformationassociatedto each one. We
introduceanothersemanticentitycalled "scene",whichiscomposed byagroupofconsecutive
shots. Scenes representa level of semantic abstraction in which audio and visual signals are
jointly consideredin order to reach somemeaningful informationabout theassociated stream
ofdata.
A rst approach to scene identication consists in the denition of four dierent types of
scenes: dialogues, inwhichtheaudiosignalismostlyspeech, andthechangeof theassociated
visualinformationoccursin analternatedfashion(e.g.,ABAB... );stories,inwhichtheaudio
signalismostlyspeechwhiletheassociatedvisualinformationexhibitstherepetitionofagiven
visual content, to create ashotpattern of thetypeABCADEFAG...; actions, when theaudio
signal belongs mostly to one class (which is non speech), and the visual information exhibits
aprogressivepattern ofshots with contrastingvisual contentsof thetypeABCDEF...; nally
consecutiveshotswhichdonotbelongto oneoftheaforementionedscenesbuttheirassociated
audioisofaconsistenttypeareclassiedasgenericscene. Oncewehavedened thesekindsof
scenes,wecanlook fortheminthetime-alignedsequenceofdescriptorsobtainedaspreviously
mentioned.
A second and moregeneral approach to scene classication is represented by astatistical
pattern recognition analysis applying a clustering procedure on the basis of the sequence of
descriptorsobtainedwith along-term analysisonmultimediadata. This wayweexaminethe
linearseparabilityofdierentsceneclassesevaluatingtwoindicesofdisorderandtheassociated
4 Experimental results
Inthis section the simulation resultsobtainedusing the proposed methods are described and
discussed.
Top-down approach: Indexing of soccer games sequences
Theperformanceoftheproposedgoal-ndingalgorithmhavebeentestedon2hoursofMPEG2
sequencescontainingthesemanticeventsreportedinTable1. Almostalllivegoalsaredetected,
and thealgorithm is ableto detect someshots to goaltoo,while it givespoorresultson free
kicks. The numberof false detection isquite relevant, but wehaveto takeinto accountthat
theseresultsareobtainedusingmotioninformationonly,sothesefalsedetectionwillprobably
beeliminatedusingothertypeofmediainformation(e.g.,audioinformation).
Events Present Detected
Live Replay Total Live Replay Total
Goals 20 14 34 18 7 25
Shotstogoal 21 12 33 8 3 11
Penalties 6 1 7 1 0 1
Total 47 27 74 27 10 37
False 116
Table1: Performanceoftheproposedgoal-ndingalgorithm.
Bottom-up approach: Content Based Indexing using HMM
Audioclassication results
Thisanalysishavebeenbasedon60minutesofaudio,onwhichwehavecomparedthesupervised
classicationandtheresultsobtainedusingtheHMM.Eachsimulationhasbeencarriedonaudio
segmentwithdurationof3'.
Table2showstheclassicationpercentagesofeachclass(music,silence,speech, andnoise):
thevaluesinthemaindiagonalarethepercentages ofcorrectrecognition.
Rec.Music Rec.Silence Rec.Speech Rec.Noise Recall Prec.
Music 80.5 3.6 11.4 4.5 0.805 0.843
Silence 2.4 95.8 1.8 0 0.958 0.867
Speech 11.3 2.3 85.4 1 0.854 0.856
Noise 3.1 2.9 2.5 91.5 0.915 0.928
Table2: Recognitionpercentages,andRecallandPrecision.
The classier performance are summarized using a couple of indices evaluated for each
class. These indices are called precision and recall, and are dened as follows: precision =
N c N c +N f ; recall= N c N c +N m where N c
is the number of correct detection, N
m
is the number of missed detection and
N
f
is thenumber of wrong detection. Both these indices are in the range[0,1]. Thevarious
performanceindiceshavebeenevaluatedandtheresultsareshowninTable2.
ItisclearfromTable2thatwithnoiseandsilencethealgorithmshowthebestperformance,
while resultsin musicand speechdetection arepoorer. Thisresult is due to thehigh level of
misclassicationbetweenmusicandvoice. Theseerrorscanbeduetothefollowingproblems: i)
Videoclassication results
The performance analysis of the video classicator is based on a video stream with duration
of 60 minutes. In the same way of audio analysis, each simulation has been carried out on
videosegmentwith duration of 3'and then allthe results havebeencombinedto obtainthe
whole classication The video classication is composed from two steps: shot segmentation
by meansof transition recognition, and then correlationbetween-shotsin order to obtainthe
correlation among non contiguousshots. While the initial segmentation has made using the
HMM classicator, the between-shots analysis uses only the probability density functions of
suchmodel,wherevectorelementsofinitial stateandoftransitionmatrixaresetto 1/N.
The resultsof both analysis are shown in Table 3, where state 1stands for "no detected
transition"andstate0standsfor"detected shottransition".
Rec.State0 Rec.state1
State0 95 5
State1 1.1 98.9
Table3: Resultsin videoshotclassication.
Wehavetwopossibleerrors: state1recognizedas0andstate0recognizedas1. As Table
3indicates, it is more probablethat the systemreveals false shot change rather than it does
notdetectanexisting ones. Thereducedvalueof precisionis dueto arelativelyhighnumber
offalse shotdetections. Thesefalsedetections aredue toveryfastcameramotion,luminance
changesandmotionofbigobjectsin thescene.
Scene identicationresults
Usingtheresultsofaudioandvideoclassication,itispossibleto evaluatetheperformanceof
the scenes identication system by means of segmentation rules, asdescribed in the previous
chapters. The rst step is to align audio and video descriptors, creating a descriptor shot
sequence. Using this sequence it is possible to search the scene categories already dened
(dialogs, actions, stories egeneric scenes). Foreach kind of scene, four dierent performance
indices have been calculated: recall, precision, recall
cover
eprecision
cover
. The rst two are
denedasinthecaseoftheaudioandvideoclassicationsimulations,whilethelattertwohave
beenintroduced toconsider situationswhensomeshots arerecognizedcorrectlyand otherare
recognized improperly. A scene is considered to lowcorrectly recognized if at least one of its
shots is correctly identied (i.e., it belongs to the right kind of scene). Moreover, when two
consecutive scenes are recognized asa unique scene, only oneis considered correct (if it has
been correct classied) while the other is declared missed. The second couple of indices has
beenintroduced because sometimesthe identied scene is onlypartially overlappingwith the
real scene (some shots belongingto the identied scene do notbelongto the real scene). Let
N
c
bethe numberofshots belongingto onekindofcorrectly identiedscene, N
a
thenumber
of shotbelonging to this kind of scene, and N
r
the whole number of shots belonging to the
identiedscenesofthiskindofscene,thenwedeneprecision
cover = N c Nr ; recall cover = N c Na
Table4showstheresultsforeach typeofsceneandforeachindex.
Thesceneidenticationprocedureisbasedondeterministicrulesratherthanonastochastic
classier, and it provides limited results when it is used in order to understand the
audio-visualassemblymodalitythat thedirectorused tocreatethescenes. Theerrorscanbedue to
inaccuraciesin apriorirules,andobviouslytoerrorsinpreviousstepsofclassication.
Figures 2,3,4 show the results of the classications of three TV programmes representing
respectively3minutesof atalk-show,ofamusicprogram,andof ascienticdocumentary. As
wecansee,thedierentstatisticalevolutionofaudioandvideoindices canbeeectivelyused
toinferthesemanticcontentoftheanalyzedsignals.
Action 0.884 0.801 0.789 0.743
Story 0.790 0.755 0.759 0.693
Genericscene 0.745 0.715 0.690 0.668
Table 4: Values for each index and for each kind of scene evaluated using the resultsof the
previousstepsofaudioand videoclassication.
Figure2: Audioandvideoclassicationresultsof3minuteofatalkshow. Inthevideostream
diagram,the shotlabelis associated to thevisual contentof the relativeshot, in such away
thatshots withsimilarvisualcontentaredenoted bythesamelabel.
havingpredened inaspecicapplicationcontextthehigh-levelsemanticinstancesofeventsof
interest(e.g.,goalsinasoccergame). Onlyintermediarysemanticcharacterizationisobtainable
otherwise,toidentifymacro-scenesdeningdialogue,storyoractionsituations.
Whatappearsquiteattractiveinsteadistouselow-leveldescriptorsinprovidingafeedback
fornon-expert usersof thecontentofthedescribedaudio-visualprogramme. Theexperiments
havedemonstratedthat,byadequatevisualizationorpresentation,low-levelfeaturescarry
in-stantlysemanticinformationabouttheprogrammecontent,givenacertainprogrammecategory,
which maythus helptheviewertousesuchlow-levelinformationfornavigationorretrievalof
relevantevents.
5 Conclusion
Inthis paperwepresentedtwo dierentindexing algorithms basedrespectivelyonbottom-up
and top-down approaches. Regarding the top-down approach, we propose a semantic
index-ing algorithm based on nite-state machines and low-level motion indices extracted from the
MPEG compressed bit-stream. Consideringthe bottom-upapproach, the signalclassication
isperformedby meansof HiddenMarkovModels (HMM). Several samplesfrom theMPEG-7
content set have been analyzed using the proposed classicationschemes, demonstrating the
performanceoftheoverallapproachtoprovideinsightsofthecontentoftheaudio-visual
mate-rial. Theclassicationresultsarequitesatisfactory,anditappearsthatajointvisualizationof
audioandvisualpropertiesprovidereachinsightsontheaudio-visualprogrammecontent.
References
[1] N.Adami,A. Bugatti,R. Leonardi,P. Migliorati,L. Rossi,"Describing Multimedia
Figure 3: Audio and videoclassicationresultsof 3minute ofa musicprogram. Inthevideo
streamdiagram,theshotlabelisassociatedtothevisualcontentoftherelativeshot,insucha
waythat shotswithsimilarvisualcontentaredenotedbythesamelabel.
Figure4: Audio andvideoclassicationresultsof3minuteofascienticdocumentary. Inthe
videostreamdiagram,theshotlabelisassociated tothevisualcontentoftherelativeshot,in
suchawaythatshotswithsimilarvisualcontentaredenoted bythesamelabel.
[2] Yao Wang, Zhu Liu, Jin-cheng Huang, "Multimedia Content Analysis Using Audio and
VisualInformation",Toappearin SignalProcessingMagazine.
[3] C.Saraceno,R. Leonardi,"IndexingAudio-VisualDatabasesThroughaJointAudio and
VideoProcessing",InternationalJournalofImagingSystemsand Technology, Vol.9,No.
5,pp.320-331,Oct.1998.
[4] A.Bonzanini,R.Leonardi,P.Migliorati,"SemanticVideoIndexingUsingMPEGMotion
Vectors",Proc.EUSIPCO'2000,pp.147-150,4-8Sept.2000,Tampere,Finland.
[5] R.Zhao,W.I.Grosky,"FromFeaturestoSemantics: SomepreliminaryResults", Proc.of
IEEEInternationalConferenceICME2000,NewYork,NY,USA,30July-2August2000.
[6] F.Oppini,R.Leonardi,"AudiovisualPatternRecognitionUsingHMMforContent-based
MultimediaIndexing",Proc.ofPacketVideo2000,Cagliari,Italy.
[7] YiningDeng,B.S.Manjunath,"Content-BasedSearchofVideoUsingColor,Texture,and
Motion",Proc.ofIEEEInternationalConferenceICIP-97,SantaBarbara,California,USA,
pp.534-536,October26-29,1997.
[8] L.R.Rabiner,"ATutorialonHiddenMarkovModelsandSelectedApplicationsinSpeech