• Non ci sono risultati.

Chapter 5

Experimental evaluation

We implemented Algorithms Top-k TUA, Top-k PUA, Top-k TUC and Top-k PUC, and experimentally evaluated both running time and accuracy on real-world video and cyber security datasets.

Figure 5.1: The prototype architecture for video context

• an Activity Detection Engine

• the UAP Engine implementing the algorithms described in chapter 4

In particular, the Image Processing Library analyzes the video captured by sensors / cameras and returns the low level annotations for each video frame as output; the Video Labelerfills the semantic gap between the low level annotations captured for each frame and high level annotations, representing the observation sequence as defined in chapter 3; then, we used an Activity Detection Engine to find activity occurrences matching the well-known models, that can be classified into good and bad ones, as defined in section 1.2: thus, such a module takes as inputs the observation sequence previously caught by the Video Labeler and the stochastic activity models; finally, our framework, called UAP Enginein this overall architecture, takes as input the activity occurrences previously found with the associated probabilities and the observation sequence and discovers the Unexplained Video Activities.

All these components will be described into subsubsections 5.1.1.1, 5.1.1.2, 5.1.1.3, 5.1.1.4.

5.1.1.1 The Image Processing Library

With recent advances of computer technology automated visual surveillance has become a popular area for research and development. Surveillance cameras are installed in many

5.1 Video analysis 51

public areas to improve safety, and computer-based image processing is a promising means to handle the vast amount of image data generated by large networks of cameras.

A number of algorithms to track people in camera images can be found in the literature, yet so far little research has gone into building realtime visual surveillance systems that integrate people trackers. The task of an integrated surveillance system is to warn an operator when it detects events which may require human intervention, for example to avoid possible accidents or vandalism. These warnings can only be reliable if the system can detect and understand human behaviour, and for this it must locate and track people reliably. Thus, a moltitude of Computer Vision algorithms have been developed.

Unfortunately, another issue of this kind of algorithms is due to false alarms and lost events, which can reduce the system reliability; such a reliability is strongly related to the environmental conditions in which the system works. Basically, the contribution of Computer Vision on this kind of applications is based on difference analysis with a more or less well-known model of reality. For instance, for what concerns the pixel identification of moving objects, there is made a subtraction between the current image and the pixels of a background model. Either, when objects are tracked, the object shapes are compared with some reference models. A main aspect to be taken into account is how much the environmental variations dissociate from the model. Such variations can be in some cases due to the noise. While in indoor environments the presence of noises is quite limited, in outdoor environments such a noise may be unrestrained and excessive and can cause a considerable degradation of quality performances. There are numerous applications used for these purposes. Anyway, this kind of processing can be structured into different layers:

• pixel layer

• frame layer

• tracking layer

At the lowest layer there is located the pixel processing layer, which processes the images captured from the source in a strictly related to pixel domain. Through the base algorithms, a pixel identification is made in order to verify if pixels belong either to objects or to the background of the scene. The proposed video surveillance applications have used different kinds of pixel processing algorithms in the years. Basically, the pixel

identification can be made either through background difference techniques or through frame differenceones or through combinations of both of them. The knowledge at this layer is limited at the image pixels: so, the nature and the number of the present objects are not considered. Moreover, noise sources have to be taken into account. As a matter of fact, a specific tuning phase is needed to restrict the presence of pixels caused by noise.

In the second processing layer (Frame Processing Layer), a first interpretation of the image supplied by the underlying layer is executed. The relationships between pixels detected as zone of interest are considered. The techiniques used in this context tend to cluster the pixel found at the underlying layer, trying to erase pixels not belonging to the interested objects and representing noise (through size filtering techniques) and other ones which, though representing movement, belong to not-interested objects, like shadows or areas due to lighting effects. The main aim of such a layer is to cluster the pixels belonging to the interested objects (blobs) through different segmentation tech-niques. Moreover, another Frame Processing task consists of the feature extraction of the detected objects. This information will be the base for the next layer.

The Tracking layer aim consists of the pursuit and classification of the objects dis-covered at the previous layer. It needs the knowledge of some information about the objects previously found. For instance, this information can usually deal with size, area, shape and all the properties that may help to identify a pixel aggregate present into dif-ferent and often consecutive frames of the image sequence as an only object. For what concerns video surveillance systems, there are many aspects to be taken into account for an effective implementation. Depending on the particular application domain, there are many constraints that may cause strong consequences on the whole process. Depending on the application type, it is possible to identify some restricted zone of the image (Re-gion of Interest), where the overall attention on the video surveillance process is focused.

There are some specific applications for indoor and outdoor environments and other ones specialized in detections of particular classes of objects (as persons, cars and so on...).

The Image Processing Library used in our prototype implementation is the Reading People Tracker (RPT)[52], [53]. This library achieved a good accuracy in object detec-tion and tracking and is very easy to be installed and used on Unix systems. Moreover, it returns a XML-based output format, that is very easy to understand and process.

More in details, the Reading People Tracker is a software for tracking people in cam-era images for visual surveillance purposes. It originates from research work on people

5.1 Video analysis 53

tracking for automatic visual surveillance systems for crime detection and prevention.

It was built within the context of two PhD theses (by AM Baumberg and NT Siebel) and contains state-of-the art image processing algorithms. It is easily maintainable and well documented. Therefore it can (and has already been) easily be adapted to new re-quirements and different projects. The Reading People Tracker contains the necessary functionality to read video sequences from hard disk or a video camera (IEEE1394/DV), to manipulate the images with image filters and to analyse them with a number of detec-tion and tracking modules. The Reading People Tracker as it exists today was developed by Nils T Siebel in the European Framework V research project ADVISOR. It is based on the Leeds People Tracker which was developed by Adam Baumberg. Starting from there it took 3 years of work on software and algorithms to develop what is now called the ”Reading People Tracker”. Now that the ADVISOR project has finished the Read-ing People Tracker is still beRead-ing maintained and available for download. Recent code changes include better support for newer compilers and a number of bugfixes. There has also been some support by the community in the form of bugfixes and a few small addi-tions. This has increased stability and ease of use. The tracking functionality itself has not changed significantly since release 1.25 in 2003. However, an update to the newest version is strongly recommended for all current users.

Thus, in order for an automated visual surveillance system to operate effectively it must locate and track people reliably and in realtime. The Reading People Tracker achieves this by combining four detection and tracking modules, each of medium to low complexity. As we said before, the Reading People Tracker can work either standalone or as a subsystem of the ADVISOR system. The focus here is on tracking, specifically on how a number of detection and tracking algorithms can be combined to achieve robust tracking of people in an indoor environment.

Automated visual surveillance systems have to operate in realtime and with a min-imum of hardware requirements, if the system is to be economical and scalable. This limits the complexity of models that can be used for detection and tracking. Any attempt at designing a People Tracker for a surveillance system like ADVISOR therefore has to consider the realtime aspect during algorithm design.

Figure 5.2 shows the overall system layout, with individual subsystems for tracking, detection and analysis of events, together with storage and human-computer interface subsystems to meet the needs of the surveillance system operators. Each of these

sub-Figure 5.2: People Tracking as one of six subsystems of ADVISOR

systems is designed to run in realtime on off-the-shelf PC hardware, with the ability to process video input from a number of cameras simultaneously. The connections between the subsystems are realised by Ethernet. Images are transferred across the network using JPEGimage compression. Other data, such as the output of the People Tracker and the results of the Behaviour Analysis, are represented in XML formats defined by a number of XML Schemas.

The Reading People Tracker has been designed to run either as a subsystem of AD-VISOR or in standalone mode. For its design four detection and tracking modules of medium to low complexity have been chosen, improved and combined in a single track-ing system.

Originally based on the Leeds People Tracker, the most important one of the four modules is a slightly modified version of Adam Baumbergs Active Shape Tracker. The people tracker has been modified over time to increase tracking robustness, and adapted for use in ADVISOR. The tracker was ported from a SGI platform to a PC running GNU/Linuxto facilitate economical system integration.

The People Tracking subsystem is itself comprised of four modules which cooperate to create the overall tracking output, aiding each other to increase tracking robustness and to overcome the limitations of individual modules. The following passages will focus on

5.1 Video analysis 55

Figure 5.3: Overview of the four modules of the Reading People Tracker

those aspects of the tracking algorithms which are special to this system.

Figure 5.3 shows an overview of how the four modules comprise the People Tracker.

Here is a brief overview of the functionalities of the individual models.

Motion Detector: This module models the background as an image with no people in it. The Background Image is subtracted pixelwise from the current video image and thresholded to yield the binary Motion Image. Regions with detected moving blobs are then extracted and written out as the output from this module.

Region Tracker: The Regions output by the Motion Detector are tracked over time by the Region Tracker. This includes region splitting and merging using predictions from the previous frame.

Head Detector: The Head Detector examines the areas of the binary Motion Image which correspond to moving regions tracked by the Region Tracker. The topmost points of the blobs in these region images that match certain criteria for size are output as possible positions of heads in these Regions.

Active Shape Tracker: This module uses an active shape model of 2D pedestrian out-lines in the image to detect and track people. The initialisation of contour shapes is done from the output by the Region Tracker and the Head Detector.

The main goal of using more than one tracking module is to make up for deficiencies in the individual modules, thus achieving a better overall tracking performance than a single module could provide. Of course, when combining the information from different modules, it is important to be aware of the main sources of error for those modules. If two modules are subject to the same type of error, then there is little benefit in combining

the outputs. The new People Tracker has been designed keeping this aspect in mind, and using the redundancy introduced by the multiplicity of modules in an optimal manner.

These are the main features of the system:

• interaction between modules to avoid non- or mis-detection

• independent prediction in the two tracking modules, for greater robustness

• multiple hypotheses during tracking to recover from lost or mixed-up tracks

• all modules have camera calibration data available for their use

• through the use of software engineering principles for the software design, the system is scalable and extensible (new modules...) as well as highly maintainable and portable

Each of the modules has access to the output from the modules run previously, and to the long-term tracking history which includes past and present tracks, together with the full tracking status (visibility, type of object, whether it is static etc). For each tracked object, all measurements, tracking hypotheses and predictions generated by different modules are stored in one place.

So, for all the reasons described above, the Reading People Tracker has been chosen as Image Processing Library of our prototype implementation. It is very easy to be used and it takes the frame sequence of the video as inputs and returns an XML file describing the low level annotations catched in each frame, according to a standard schema defined in a XML Schema. We have only made some few updates to the RPT’s source code, in order to be able to get more easily the type of each object detected in a frame (person, package, car). For instance, figure 5.5 shows the low level annotations associated to the frame number 18 (figure 5.4) of a video belonging to the ITEA - CANDELA dataset (http://www.multitel.be/˜va/candela/abandon.html), which has been used to make some prelimary experiments.

As we can see in figure 5.5, the RPT correctly identifies two objects into the frame shown in figure 5.4: the former, identified by ID = 5, is a person, while the latter, identified by ID = 100 is a package.

However, we have manually filtered some errors found into low level annotations, in order to make a more reliable, correct and, first of all, unconditional experimentation of

5.1 Video analysis 57

Figure 5.4: A video frame from ITEA-CANDELA dataset

Figure 5.5: The related low level annotations

our Unexplained Activity Detector on real-world datasets, that is the main goal of this thesis.

5.1.1.2 The Video Labeler

As we said before, the Video Labeler fills the semantic gap between the low level an-notations captured for each frame and the high level anan-notations. So, through the Video Labeler, the observation IDs (as defined in the chapter 3) with the related action symbols and timestamps are detected; thus, the output of the Video Labeler is the observation sequencerelated to the considered video source.

The Video Labeler has been implemented in Java programming language: it uses the DOM librariesto parse the XML file containing the output of the Image Processing Li-brary. The Video Library defines the rules that have to be checked to verify the presence of each interested high level atomic event in the video. So, a Java method for each action symbol we want to detect, containing the related rules, has been defined.

There are listed below some examples of rules defined to detect some atomic events (action symbols) in a video belonging to the ITEA-CANDELA dataset.

Action Symbol A: A person P goes into the central zone with the package

• There are at least two objects in the current frame

• At least one of the objects is a person

• At least one of the objects is a package

• The person identified appears on the scene for the first time

• The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold

Action Symbol B: A preson P leaves the package

• There are at least two objects in the current frame

• At least one of the objects is a person

• At least one of the objects is a package

• The person was previously holding a package

• The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold

Action Symbol C: A preson P goes into the central zone

• There are at least one object in the current frame

• At least one of the objects is a person

• The person identified appears on the scene for the first time

5.1 Video analysis 59

• If there are also some packages on the scene, their distances are greater than an apposite distance threshold

Action Symbol D: A preson P takes the package

• There are at least two objects in the current frame

• At least one of the objects is a person

• At least one of the objects is a package

• The distance between the person’s barycenter and the package one is smaller than an apposite distance threshold

• The person was not previously holding a package

Action Symbol E: A person P1 gives the package to another person P2

• There are at least three objects in the current frame

• At least two of the objects are persons

• At least one of the objects is a package

• P1 was previously holding a package

• In the current frame, both the distances of P1 and P2’s barycenters from the pack-age are smaller than an apposite distance threshold

• In the next frames, P1’s distance from the package is greater than the threshold, while P2’s one is smaller (it means that P2 has got the package and P1 is not holding it anymore)

Action Symbol F: A person P goes out of the central zone with the package

• This symbol is detected when a person holding a package does not appear anymore on the scene for a specified TTL

Thus, the output of the Video Labeler is the list of the action symbols detected in the video with the related timestamps; it is encoded in comma-separated value format.

5.1.1.3 The Activity Detection Engine

An Activity Detection Engine is able to find activity occurrences matching the well-known models: thus, such a module takes as inputs the observation sequence previously caught by the Video Labeler and the stochastic activity models. So, taking into account that the activity models must follow the schema defined into chapter 3, a specific soft-ware able to detect instances matching such models in time-stamped observation data has been used. Such software, called tMAGIC, is the implementation of a theoretical model presented in [54].

As a matter of fact, the [54] approach addresses the problem of efficiently detecting occurrences of high-level activities from such interleaved data streams. In this approach, there has been proposed a temporal probabilistic graph so that the elapsed time between observations also plays a role in defining whether a sequence of observations constitutes an activity. First, a data structure called temporal multiactivity graph to store multiple activities that need to be concurrently monitored has been proposed. Then, there has been defined an index called Temporal Multi-Activity Graph Index Creation (tMAGIC) that, based on this data structure, examines and links observations as they occur. There are also defined some algorithms for insertion and bulk insertion into the tMAGIC index and show that this can be efficiently accomplished. In this approach, the algorithms are basically defined to solve two problems: the evidence problem that tries to find all occurrences of an activity (with probability over a threshold) within a given sequence of observations, and the identification problem that tries to find the activity that best matches a sequence of observations. There are introduced some complexity reducing restrictions and pruning strategies to make the problem, which is intrinsically exponential, linear to the number of observations. It is demonstrated that tMAGIC has time and space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored activities.

Thus, the output of the Activity Detection Engine are the well-known activity occur-rences matching the defined models and their probabilities: both have been efficiently computed through the tMAGIC software.

5.1 Video analysis 61

5.1.1.4 The UAP Engine

As we have described in the previous chapters, the UAP Engine takes as input the ac-tivity occurrences previously found by the Acac-tivity Detection Engine with the associated probabilities and the observation sequence and finally discovers the Unexplained Video Activities. Such a module has been developed in Java programming language and pro-vides the implementations of algorithms Top-k TUA, Top-k PUA, Top-k TUC. As we can also see in figure 5.1, the Java module uses some apposite APIs to interact with a linearand a non-linear program solver. More in details, the QSopt Library has been used for solving linear programs and the Lingo Library for non-linear ones.

Thus, the subsections 5.1.2 and 5.1.3 show the experimental evaluations we have made in the video surveillance domain. In particular, we evaluated our framework on two video datasets: (i) a video we shot by monitoring a university parking lot, and (ii) a benchmark dataset about video surveillance in a train station [55]. The frame obser-vations have been generated in a semi-automatic way using both our prototype imple-mentation we have just described and few human interventions. We have noted that identifying frame observations via the development of image processing algorithms is an extremely challenging task—the goal of our work is to present a domain-independent way of identifying unexplained activities that builds upon domain-specific ways of recog-nizing actions in observation sequences. In contrast to the difficulty of detecting actions in video, in cyber-security, it is easy to identify actions in an observation sequence as they can merely be logged.

5.1.2 Parking lot surveillance video

The set A includes “known” normal activities such as parking a car, people passing, a person getting in a car and leaving the parking lot, and abnormal activities such as a person taking a package out of the car and leaving it in the parking lot before driving away, or a person taking an unattended package in the parking lot. For instance, a model of a well-known normal activity is shown in figure 5.6.

Examples of detected unexplained activities are two cars stopping next to each other in the middle of the parking lot with the drivers exchanging something before leaving the parking lot, or a person strolling around a car for a while before leaving the parking lot.

We compared Algorithms Top-k TUA and Top-k PUA against “na¨ıve” algorithms

Documenti correlati