Advanced algorithms for audio and image processing

(1)

5.7 ACIVW Performance Analysis 86

Figure 5.7 Examples of classes of the ACIVW Dataset. From the top: razor, hair dryer, vacuum cleaner, shopping cart, traffic. Left: RGB frame, center: acoustic energy map overlaid on the acoustic frame, right: single microphone spectrogram.

5.7 ACIVW Performance Analysis

We show the confusion matrices on one run for the three different supervised models, namely DualCamNet, HearNet and ResNet18, in Figures 5.8, 5.9, 5.10, respectively. Both

(2)

5.7 ACIVW Performance Analysis 87

DualCamNet and HearNet classification is good, as we can see from diagonal confusion matrices. DualCamNet only confuses drill with razor and vice versa. Concerning ResNet18, we notice that it makes a lot of confusion when classifying the classes drone, fountain, drill and mostly hair dryer. This is because we collected our dataset in real scenarios where not always the items are easy to detect because of occlusions, deceiving details in the scene and the tiny object size (for example a drone). In some cases, instead, the object is visually difficult to classify as its appearance changes a lot from one video to another one, for instance the fountain, instead hair dryer and drill sometimes are alone in the scene, sometimes used by a person.

As demonstrated in Table 2 (top box) in the main chapter, in fact, we can actually classify very well both spectrograms and acoustic images, while video classification is more challenging.

Figure 5.8 Supervised DualCamNet confusion matrix.

(3)

5.8 Implementation Details 88

Figure 5.10 Supervised ResNet18 confusion matrix.

5.8 Implementation Details

5.8.1 Data Preparation

We implemented all of our networks and our data processing pipeline using TensorFlow. In particular we stored our dataset in multiple compressed TFRecord files, each of which contains 1 second of synchronized data from the three modalities, video images, raw audio waveforms, and acoustic images. We use the tf.data API to retrieve this data and com-pose at runtime 2.0 s sequences grouping contiguous TFRecord files into full audio-video sequences.

5.8.2 Hyper-parameters

We cross-validated learning rate, number of epochs and margin. We chose the biggest batch size that could fit GPU available memory. Then, we employed the following hyper-parameters throughout all the experiments: learning rate 10−5, margin m = 0.2, 20 epochs, batch size 64 for self-supervised case; learning rate 10−4, 100 epochs, batch size 32 for the supervised case and for [78], where we also chose α = 0.5 and T = 1, as indicated therein. We compare our results with [86], trained with learning rate 10−3, 100 epochs, batch size 16 and we also trained the audio and video sub-networks separately for 100 epochs with batch size 32 and learning rate 10−5and 10−6, respectively. Results are averaged over 5 runs.

(4)

5.9 Cross-modal Retrieval 89

5.8.3 k-NN

Regarding the k-NN classification accuracy results, we cross-validated the considered number of nearest neighbors k considering odd numbers between 7 and 15.

5.8.4 Triplet Loss Margin

We cross-validated margin m choosing it among {0.2, 0.5, 1.0, 1.5}. m = 0.2 was our best performing option in case of no distillation and it is usually chosen as default value [94] for the triplet loss. We kept it fixed to m = 0.2 for the second setup as well, i.e. when distilling from DualCamNet.

5.8.5 Knowledge distillation

To perform distillation, we consider a pre-trained acoustic image network. This model was trained in advance in a self-supervised manner together with video network using correspondence pretext task. We restored the teacher model corresponding to the epoch where the acoustic image network had best classification results on the validation set. This was done separately for each of the five runs of distillation.

5.9 Cross-modal Retrieval

We show some retrieval examples in Figures 5.11 and 5.12. The first sample on the left marked by a note is the audio embedding, the other images on the right from the second column on, are the corresponding retrieved RGB images with increasing distance from k = 1 to k = 5. In green, there are samples belonging to the same class, in blue belonging to same video, in red to different classes. Given an audio embedding (of a single microphone audio or acoustic image), we retrieve the corresponding video frame by matching the closest audio-visual embedding. We cannot do the opposite because the audio-visual embedding is a function of the audio and cannot be computed without its information.

Results for each class are shown in 2 rows: we plot the first 5 retrieved images for one audio embedding by considering, in the first row, acoustic image and, in the second row, single microphone audio.

We are also able to retrieve RGB frames from different clips of the same class, not only samples which belong to the same video.

(5)

5.9.1 Input data

For the three modalities we consider temporal windows of 2.0 s, which represent a good compromise between information content and computational load.

Monaural Audio. The audio amplitude spectrogram is obtained from an audio waveform of 2 seconds, upsampled to 22 kHz by computing the Short-Time Fourier Transform (STFT), considering a window length of 20 ms with half-window overlap. This produces 200 windows with 257 frequency bands. The resulting spectrogram is interpreted as a 200 × 1 × 257 dimensional signal, so that the frequency bands can be interpreted as the number of channels in convolutions, as detailed in Figure 5.4.

Acoustic images. Acoustic images are generated with the frequency implementation of the filter-and-sum beamforming algorithm [70]. They are volumes of size 36 × 48 × 512 (36 × 48 as image size, 512 frequency channels). These channels correspond to the frequency bins discretizing frequency content for each pixels. A more comprehensive description of acoustic images generation can be found in [92]. However, handling acoustic images with 512 channels is computationally expensive and most of the useful information is typically contained in the low frequencies. Consequently, we compressed the acoustic images along the frequency axis using Mel-Frequency Cepstral Coefficients (MFCC), which consider human audio perception characteristics [95]. We thus compute 12 MFCC, compressing from 512 to 12 channels, preserving most of the information but consistently reducing the computational complexity and the required memory.

RGB video. RGB frames are 224 × 298 × 3 volumes obtained by scaling the original 360 × 480 × 3 video frames, keeping the original proportion. The images are then normalized by subtracting ImageNet mean [96].

5.9.2 Single data stream models

The chosen architectures for self-supervised learning are depicted in Figure 5.4: ResNet18 for RGB frames [97], DualCamNet for acoustic images [78], and HearNet [98] for the single audio signal. All networks were slightly modified for our purposes as illustrated in Figure 5.4. We consider ResNet18 since it can be trained from scratch on our dataset at relatively low computational cost and without the risk of overfitting. In fact, we do not want to rely on ImageNet pre-trained models to avoid employing labels at all. HearNet draws inspiration from [98] and [78], but it has been modified to consider our sampling time interval of 2.0 s (instead of 5.0 s). Such network takes spectrogram as input and has a limited size, again making it suitable to be trained from scratch.

(6)

We cut ResNet18 and DualCamNet in order to obtain feature volumes and then compute a similarity score map between them, via point-wise multiplication. In particular, we modified the original ResNet18 removing the 4-th block and the final average pooling, adding instead a 2D convolution at the end of the network. The feature volumes keep the same spatial relationships of the original image and the acoustic image. In fact, these maps are proportional to the 224 × 298 RGB image and to the 36 × 48 acoustic image.

From HearNet, we cannot get spatial feature maps as explained above in Subsection 5.9.1, since the signal is one-dimensional, but only a 128D array after 2 fully connected layers. In order to obtain the same similarity map as above, we propose to tile audio feature vectors along 2D spatial dimensions, to match the dimensions of the video feature map. This allows then to multiply the two maps in a point-wise fashion, as in the case of the acoustic image.

In order to obtain baselines, ResNet18 and DualCamNet can be trained in a supervised way (Table 5.3 - top) by adding a simple average pooling layer followed by a fully connected layer. HearNet also requires to add one fully connected layer for supervised training, in order to match the number of classes.

5.9.3 Pretext task

We propose the self-supervised training procedures depicted in Fig. 5.5: we employ 2 trainable streams, from which 2 feature map volumes are extracted. We obtain a similarity score map by multiplying component-wise each 12 × 16 × 128 acoustic image and video feature maps. For monaural audio, the feature cuboid is obtained by replicating the 128D vector by HearNet in each spatial location. This is different from both [87; 89], as they use a dot product in each spatial location to obtain a scalar map. Keeping the original depth in the similarity score map allows to retain more information about the input.

The output of the architecture is one audio-visual feature vector and one audio vector, either obtained from DualCamNet or from HearNet. The former is a 128D vector obtained by a second element-wise product between the similarity score map and the video feature map itself, followed by a sum along the two spatial dimensions. This corresponds to a weighted sum, where the weights come from the similarity map. Instead, the 128D audio feature vector is obtained by a sum along two spatial dimensions of the acoustic feature map in case of DualCamNet, while no sum is needed in case of Hearnet, since it outputs a 128D already.

The two feature vectors are normalized and feed a triplet loss [94]:

T L(X ,Y ) = N

∑

i=1

(7)

where [ f (x)]+ represents max( f (x), 0), m is a margin between positive and negative pairs, and f (xi) are the normalized feature vectors. The triplet loss aims to separate the positive pairs from the negative ones by a distance margin. This is obtained by minimizing the distance between an anchor a and a positive, p, both of which have the same identity, and maximizing the distance between the anchor a and a negative n of a different identity. In our case, we want an audio embedding fX(xai) to have a small squared distance from video embeddings fY(yip) from the same clip, and a large one from video embeddings fY(yni) obtained from a different clip.

It is crucial to select carefully triplets to make the networks learn. In particular, we exploit a curriculum learning scheme [99]: in the first epochs, we use all the triplets that contribute to the training (i.e. with a loss greater than zero), and later on only the hardest triplets: for each anchor, we select the hardest negative as the one with smallest distance, and the hardest positive as the one with largest distance from the anchor.

5.9.4 Knowledge distillation

Distillation is carried out by exploiting a self-supervised pre-trained DualCamNet, as depicted in the right part of Fig. 5.5. To this end, we exploit an additional triplet loss between the single-audio and the acoustic-image embeddings vectors, which we name Transfer Loss Lt= T L(H, D), where H stands for HearNet, D for DualCamNet and T L is the triplet loss. Such loss tries to transfer effective embeddings learned with DualCamNet to the monaural audio model.

The total loss is thus the weighted sum of the triplet loss between HearNet and ResNet18 embeddings and the transfer loss, which is calculated between (previously trained) DualCam-Net and HearDualCam-Net embeddings vectors:

L_{T OT} = αLt+ (1 − α)T L(H, R) (5.2)

where 0 ≤ α ≤ 1, H is HearNet, R is ResNet18.

The imitation parameter α measures how much HearNet features will resemble DualCam-Net features. Note that in the limit case where α = 0 we fall in the standard self-supervised case with no knowledge transfer. We consider different values of the imitation parameter α to assess how much we have to weight the two losses.

(8)

5.10 Experiments 93

5.10 Experiments

Classification and cross-modal retrieval are the downstream tasks used to evaluate the quality and generalization capability of the features learned with the proposed approach.

We compare our self-supervised method with supervised distilled audio model [78] and with L3Net [86]. Features considered for our work and [78] are 128D.

The correspondence accuracy of L3Net on our dataset is 0.8386 ± 0.0035. We consider both self-supervised audio and video sub networks obtained from L3Net to extract features as well as audio and video supervised models trained separately adding after the final 512D feature vectors a fully connected layer with size equal to number of classes. Both supervised and self-supervised models of L3Net [86] have features 512D. Training details are in the supplementary material.

5.10.1 Cross-modal retrieval

The target of cross-modal retrieval consists in choosing one audio sample and searching for the corresponding video frames of the same class. The audio sample comes either in the form of an acoustic image or of a spectrogram. We will specify whenever needed. Given an audio sample, corresponding audio-visual embeddings are ranked based on their distance in the common feature space. Rank K retrieval performance measures if at least one sample of the correct class falls in the top K neighbours.

Fixing a query audio we can compute audio-visual embeddings for any given video sample, while we cannot fix the audio-visual embedding, because its value will be different for different audios. Thus, we perform cross-modal retrieval only from audio to images and not vice versa.

Results are presented in Table 5.2 and refer to the test set of the ACIVW dataset. They clearly show that audio-visual representations learned with acoustic images (DualCamNet) are consistently better than those learned with monaural audio alone. Besides, results are good in absolute terms, considering that random chance on Rank 1 is 10% and that features learned with the pretext task proposed by [86] are less effective.

Table 5.2 CMC scores on ACIVW Dataset for k = 1, 2, 5, 10, 30.

Model Rank 1 Rank 2 Rank 5 Rank 10 Rank 30

DualCamNet 33.41±3.65 37.01±3.17 42.97±2.25 48.21±1.80 62.44±1.33 HearNet 28.95±2.15 34.40±3.27 42.43±4.88 48.08±5.77 61.43±4.94 L3Audio Subnetwork [86] 9.74±0.33 11.91±4.09 24.23±9.02 26.78±8.60 30.14±10.00

(9)

5.10 Experiments 94

5.10.2 Classification

For this task we use the trained models as feature extractors and classify the extracted features with a K-Nearest Neighbor (KNN) classifier. We consider both audio and audio-visual features computed as explained in Subsection 5.9.3. We benchmark against the proposed ACIVW dataset as reference and then test the generalization of features on two additional datasets: Audio-Visually Indicated Action Dataset (AVIA) [78] and Detection and Classification of Acoustic Scenes (DCASE - version 2018) [79]. AVIA is a multimodal dataset which provides synchronized data belonging to 3 modalities: acoustic images, audio and RGB frames. DCASE 2018 is a renowned audio dataset, containing recordings from six large European cities, in different locations for each scene class.

ACIVW Dataset.

Table 5.3 Accuracy results for models on ACIVW dataset. Results are averaged over 5 runs. H means obtained from HearNet model and D from DualCamNet model.

Features Training Test accuracy

L3 Audio Subnetwork [86] supervised 0.6424 ± 0.2857

HearNet supervised 0.8779 ± 0.0145

HearNet w/ transfer [78] supervised 0.8578 ± 0.0198

L3 Vision Subnetwork [86] supervised 0.4647 ± 0.0225

ResNet18 supervised 0.5123 ± 0.0521

DualCamNet supervised 0.8378 ± 0.0187

L3 Audio Subnetwork [86] self-supervised 0.3605 ± 0.0265

HearNet self-supervised w/o transfer 0.7573 ± 0.0278

HearNet self-supervised w/ transfer α = 0.1 0.7697 ± 0.0147 HearNet self-supervised w/ transfer α = 0.3 0.7896 ± 0.0092 HearNet self-supervised w/ transfer α = 0.5 0.7946 ± 0.0137 HearNet self-supervised w/ transfer α = 0.7 0.7810 ± 0.0206 HearNet self-supervised w/ transfer α = 0.9 0.7867 ± 0.0093 L3 Video Subnetwork [86] self-supervised 0.5444 ± 0.0839 Audio-visual (H) self-supervised w/o transfer 0.6670 ± 0.0446 Audio-visual (H) self-supervised w/ transfer α = 0.1 0.7061 ± 0.0496 Audio-visual (H) self-supervised w/ transfer α = 0.3 0.7144 ± 0.0223 Audio-visual (H) self-supervised w/ transfer α = 0.5 0.7125 ± 0.0200 Audio-visual (H) self-supervised w/ transfer α = 0.7 0.7191 ± 0.0285 Audio-visual (H) self-supervised w/ transfer α = 0.9 0.7322 ± 0.0070

Audio-visual (D) self-supervised 0.5837 ± 0.0468

DualCamNet self-supervised 0.7457 ± 0.0292

In Table 5.3 we report, in the top part, the fully supervised classification baseline accu-racies for the single stream architectures described in Subsection 5.9.2. The bottom part

(10)

5.10 Experiments 95

lists instead the KNN classification accuracies for the models trained with the proposed self-supervised framework.

For supervised models we choose the model with best validation accuracy and provide its test performance. For self-supervised models we fix a number of iterations (20 epochs). Averages and standard deviations are computed over 5 independent runs. Results show that the videos in our dataset are quite challenging to classify. Audio models perform instead much better than video ones.

When training in a self supervised manner, audio models naturally experience a drop in performance. Such drop is partially recovered when training with the additional supervision of DualCamNet features. Hearnet w/ transfer for α = 0.5 is indeed boosted by ∼ 4%. Audio-visual features, although obtained with self-supervision, are better than visual features obtained with supervision using ResNet18. This is due to the fact that audio information can help to better discriminate the class. Also in this case the transfer is beneficial, increasing performance by ∼ 6% for α = 0.9. This is true also for self-supervised video subnetwork [86], which performs better than supervised one. This shows that when one modality is difficult to classify, self-supervision is able to improve accuracy.

Different values of the imitation parameter α ∈ {0, 1; 0, 3; 0, 5; 0, 7; 0, 9} are investigated. We notice that both audio and audio-visual accuracies are always improved by the transferring, for all values of α.

In detail, our models perform better than both supervised and self-supevised audio and video models of L3net subnets [86]. Our supervised audio network HearNet does not have an improvement using distillation [78] maybe because our dataset is much more challenging than the AVIA Dataset presented in [78]. In fact, ACIVW data presents many different scenarios with different noise types and as stated by [78], the acoustic images distillation works well in cases with almost no noise.

AVIA Dataset.

Features learned on ACIVW are also tested on a public multimodal dataset containing acoustic images, namely Audio-Visually Indicated Action Dataset (AVIA) [78]. We compare the result of the audio and audio-visual features extracted using this dataset in Table 5.4. We have a general drop in accuracy because we are testing on a different dataset, however, in this case DualCamNet has the best results, proving better generalization performance than monaural features. The improvement by the the self-supervision w/ transfer is again confirmed. Different values of the imitation parameter α ∈ {0, 1; 0, 3; 0, 5; 0, 7; 0, 9} are investigated. In particular, we notice that α = 0.5 for audio features and α = 0.9 for

(11)

audio-5.10 Experiments 96 Table 5.4 Accuracy results for models trained on ACIVW dataset and tested on AVIA. Results are averaged over 5 runs. H means obtained from HearNet model and D from DualCamNet model.

Features Training Test accuracy

L3 Audio Subnetwork [86] supervised 0.3713 ± 0.0233

HearNet supervised 0.3108 ± 0.0114

HearNet w/ transfer [78] supervised 0.3556 ± 0.0181

L3 Vision Subnetwork [86] supervised 0.0287 ± 0.0013

ResNet18 supervised 0.0263 ± 0.0073

DualCamNet supervised 0.4783 ± 0.0224

L3 Audio Subnetwork [86] self-supervised 0.0571 ± 0.0175

HearNet self-supervised w/o transfer 0.4103 ± 0.0248

HearNet self-supervised w/ transfer α = 0.1 0.4393 ± 0.0097 HearNet self-supervised w/ transfer α = 0.3 0.4749 ± 0.0305 HearNet self-supervised w/ transfer α = 0.5 0.4817 ± 0.0165 HearNet self-supervised w/ transfer α = 0.7 0.4851 ± 0.0214 HearNet self-supervised w/ transfer α = 0.9 0.4592 ± 0.0271 L3 Vision Subnetwork [86] self-supervised 0.3347 ± 0.0638 Audio-visual (H) self-supervised w/o transfer 0.2660 ± 0.0309 Audio-visual (H) self-supervised w/ transfer α = 0.1 0.2759 ± 0.0163 Audio-visual (H) self-supervised w/ transfer α = 0.3 0.3200 ± 0.0204 Audio-visual (H) self-supervised w/ transfer α = 0.5 0.3070 ± 0.0294 Audio-visual (H) self-supervised w/ transfer α = 0.7 0.3091 ± 0.0351 Audio-visual (H) self-supervised w/ transfer α = 0.9 0.3162 ± 0.0310

Audio-visual (D) self-supervised 0.2927 ± 0.0234

DualCamNet self-supervised 0.5132 ± 0.0167

visual features are still good values of α. Self-supervised models generalize better than the supervised trained ones apart from Audio subnetwork [86]. In particular, HearNet self-supervised is more general than the one trained with distillation [78].

DCASE 2018.

In Table 5.5 we report classification accuracies (KNN) for DCASE 2018. Features are extracted from models pre-trained on ACIVW Dataset with supervised and self-supervised training. Self-supervised learned representations provide a better accuracy, showing that learning from concurrence of two modalities can lead to better generalization than learning from labels and with supervised distillation [78]. Transferring is useful to obtain more general features and the best result is that of α = 0.3. For [86] this does not happen. However, even if the result of supervised case is better than the self-supervised submodule, it has a lower accuracy than our audio models self-supervised with acoustic image transfer.

(12)

5.11 Conclusions 97 Table 5.5 Accuracy for audio models tested on DCASE 2018.

Features

Training

Test accuracy

L

3

Audio Subnetwork [86]

supervised

0.3576 ± 0.0127

HearNet w/ transfer [78]

supervised

0.2989 ± 0.0106

HearNet

supervised

0.3022 ± 0.0088

L

3

Audio Subnetwork [86]

self-supervised

0.3231 ± 0.0473

HearNet

self-supervised

0.3535 ± 0.0188

HearNet

self-supervised α = 0.1

0.3653 ± 0.0079

HearNet

self-supervised α = 0.3

0.3757 ± 0.0094

HearNet

self-supervised α = 0.5

0.3737 ± 0.0068

HearNet

self-supervised α = 0.7

0.3696 ± 0.0098

HearNet

self-supervised α = 0.9

0.3638 ± 0.0072

5.11 Conclusions

In this chapter, we have investigated the potential of acoustic images in a novel self-supervised learning framework and with the aid of a new multimodal dataset, specifically acquired for this purpose. Evaluating the trained models on classification and cross-modal retrieval downstream tasks, we have shown that acoustic images are a powerful source of self-supervision and their information can be distilled into monaural audio and audio-visual representation to make them more robust and versatile. Moreover, features learned with the proposed method can generalize better to other datasets than representations learned in a supervised setting. Next step and work: we recorded a musical instruments dataset outdoors, played by musicians. We both recorded single instruments and couples of them. We will use it for future work, aiming to distinguish two sounds in the same scene associating each sound with each class in the video frame. If we could do this, we would be able to improve audio-visual localization when many objects are in the scene and maybe they are also producing together noise. We collected a musical instruments dataset (9 instruments) recorded both single and pairs of instruments to use audio-visual information to distinguish two or more sounds in the same frame to improve localization when many objects are present in the scene making sound.

(13)

5.11 Conclusions 98

Figure 5.11 Examples of ACIVW Dataset retrieved samples from the following classes. From top to bottom, two rows per class: train, boat, drone, fountain, drill.

(14)

5.11 Conclusions 99

Figure 5.12 Examples of ACIVW Dataset retrieved samples from the following classes. From top to bottom, two rows per class: razor, hair dryer, vacuum cleaner, shopping cart, traffic.

(15)

Chapter 6 Industrial exploitation

6.1 Introduction and Motivation

The proprietary Dual Cam technology is based on an innovative and unique sensor which generates an optical-acoustic images of a scene. Included in the Dual Cam software package is advanced image analysis and Artificial Intelligence techniques that allow the automatic in-terpretation of the multi-modal scene (see https://www.youtube.com/watch?v=7lXsufflhkk) strategic in several applications: monitoring of traffic/vehicles, drone detection, industrial inspection, crime prevention and so on (Figure 6.1). The security and surveillance systems on the market mainly rely on the visual component provided by video surveillance cameras and simply store the video streams acquired on digital storage media, regardless of the presence or absence of audio events of interest in the scene. This functionality is not sufficient in all applications that require a timely reaction from security personnel. Below there is a rough quantification for the main markets of interest for the proposed innovation.

1. The global maritime surveillance market size was valued at $19.20 billion in 2018, and is projected to reach $40.61 billion by 2026, registering a CAGR of 9.5% from 2019 to 20261.

2. The global intelligent traffic management market valued at $20.53 billion in 2018 and is expected to reach $40.22 billion by year 2026, at a CAGR of 8.7%2.

1_{https://www.alliedmarketresearch.com/press-release/maritime-surveillance-market.html}

2https://www.globenewswire.com/news-release/2019/10/24/1935196/0/en/Intelligent-Traffic-Management-Market-To-Reach-USD-40-22-Billi on-By-2026-Reports-And-Data.html

(16)

6.2 Problem/Opportunity and Potential Impact 101

Figure 6.1 Potential Dual Cam markets.

3. The global video surveillance market was valued at $28.18 billion in 2017, and is projected to reach $87.36 billion by 2025, growing at a CAGR of 14.2% from 2018 to 20253.

4. The Industry 4.0 market is estimated to be valued $71.7 billion in 2019 and is expected to reach $156.6 billion by 2024, at a CAGR of 16.9% from 2019 to 20244.

6.2 Problem/Opportunity and Potential Impact

Dual Cam can be applied in several and heterogeneous security scenarios, summarized in the Figure 6.2.

• Airport video surveillance: the drone market is growing fast. The Teal Group, an American military and aerospace consulting firm, has estimated that in ten years the world drone market will be worth about 91 billion US dollars. In the US alone, there are currently 325.000 registered drone pilots. Flights suspended at airports due to the presence of drones are an emerging problem with a strong economic and social impact5.

• Public events security: Recent news events show that drones can be a security threat during public and entertainment events such as a sporting event, a concert or a political rally, therefore a trend is expected in which this type of activity will also be regulated and will have to consequently, technologies must be adopted that allow to detect an infringement of the regulation in the most automatic way possible.

3_{https://www.alliedmarketresearch.com/Video-Surveillance-market}

4_{https://www.prnewswire.com/news-releases/the-industry-4-0-market-is-estimated-to-be-valued-usd-71-7-billion-in-2019-and-is-expected-to-re}

ach-usd-156-6-billion-by-2024--at-a-cagr-of-16-9-from-2019-to-2024--300976813.html

(17)

6.3 Technology, Product and IPR 102

• Abnormal events security: another strategic sector of use of the Dual Cam sensor is the location of anomalous events, in particular gunshots, snatches and attacks. Only in US, there are around 81 deaths per day from a gunshot and studies show that 80% of cases remain unsolved. The current security systems, typically based on the collection of high-resolution video information, do not allow real-time localization and identification of those who committed the crime, especially in public places and very crowded scenarios, as in the case of big events. The innovative Dual Cam sensor gives the possibility to locate a shot using advanced artificial intelligence techniques and to instantly collect visual information in the direction the shot was detected.

• Television rights protection: Dual Cam is also a system to protect a public event recorded by television observed by the indiscreet eye of a drone. Legislation is slowly adapting but not as fast as the spread of drones.

• Virtual microphone: it consists in the possibility of selecting one or more arbitrary listening directions, in real time or a posteriori, and spatially filtering the sound in order to isolate the sound source of interest, canceling or strongly attenuating all sources of noise coming from other directions. This feature allows to clearly listen to a person speaking, even if he/she is placed at a considerable distance from the device and if the environment is noisy (such as a busy road). The virtual microphone eliminates the need to choose a direction or a listening area in advance, in opposition to what is needed when using directional microphones or bugs. The operator can have an arbitrary number of listening directions and can use the visual information offered by the camera to decide in real time the wanted listening direction, possibly modifying it if the person of interest moves in the scene during the acquisition. There is the possibility of offering features such as real-time control of the pointing direction and noise filtering through a dedicated APP on tablet or smartphone (application in manufacturing 4.0).

6.3 Technology, Product and IPR

The Dual Cam sensor provides a dual image stream, which by its nature is complementary: a video image stream (which can be either optical, infra-red or thermal, based on the type of camera installed inside it), and a stream of acoustic images, spatially registered between them. An acoustic image shows for each pixel a measurement of the intensity of the sound coming from a particular direction, just as in a thermal image each pixel

(18)

Figure 6.2 Main Dual Cam applications for safety & security scenario.

represents the temperature of a certain portion of the surrounding environment. Compared to a conventional microphone, the acoustic image allows to discriminate more effectively sounds from different directions and to locate them in space (click on the video https: //www.youtube.com/watch?v=5c8Ca7bR60I). Furthermore, sound has the main advantage of not being influenced by adverse weather and environmental conditions such as snow, fog or smoke; situations that pose serious difficulties for conventional automatic security systems based only on cameras. The Dual Cam business model (see http://alexosterwalder.com/, https://www.strategyzer.com/canvas) consists in:

• Production and sale of optical-acoustic sensors (different sensor size for different applications).

• Production and sale of software licenses for data processing (different versions are foreseen: basic, intermediate and advanced).

• Service and maintenance of the product.

• Hardware and software engineering service for system customization in the context of application other scenarios (transports, rail trains for example local flaws - wheel flats-with significantly increasing noise generation and so on).

• The licensing of software for the treatment of digital acoustic-optical signals for the purpose of their interpretation using artificial intelligence techniques.

(19)

Figure 6.3 New Dual Cam POC idea.

The TRL6(technology readiness level) of the current Dual Cam prototype is 5, that is, the technology has been validated in a relevant environment. The main technical domains are ICT and Manufacturing 4.0. Dual Cam hardware can also be simplified for applying the technology in the Internet of Things (IoT) and within the process of digitalization of industries and machinery which is called Industry 4.0: using a single/virtual microphone Dual Cam can be used for continuous monitoring of machinery, engines or robots, in order to develop preventive maintenance advanced artificial intelligence algorithms. To go ahead with the project, Dual Cam need the development of a compact, easy to use, and portable device. Moreover, this renewed device needs to be interfaced with a portable computer connected to the internet and running deep learning algorithms on the combined audio/video stream. We need to transform the current POC (proof of concept), suitable for a laboratory test-setup with multiple cable connectivity and unease to use, into a more engineered, market oriented, easy to use device. Is particularly important to validate the new POC unique feature of recognizing the different audio sources within the same video scene using deep learning algorithms. The physical size reduction (we want to reach 20 cm x 20 cm, see later on and Figure 6.3) implies the re-design of the algorithm used for designing the microphones displacement inside the camera and it requires a deep test in order to quantify the performance compared to the current version (planar array 45 cm x 45 cm size). Therefore, it is mandatory to perform preliminary audio simulations for microphones displacement and spectrum analyses to define hardware specifications: we need a comparative study between the signal to noise ratio (SNR) observed using the whole audio band (as in the current solution) and the SNR that would be observed in the same circumstance and in the same application domains (highway, open space, orchestra, drone, etc.) with the reduced sensor, since a change in its size reflects a

(20)

Figure 6.4 Technological proposal and our approach to the problem.

(21)

Figure 6.6 Dual Cam relevant patent landscape.

different audio frequency band which can be sensed. The innovation we want to explore is the system operation using only the upper harmonics of the audio signals, neglecting its fundamentals frequency components. Using only the upper harmonics allows to reduce the size of the array and, at the same time, to improve directivity, due to the increase of the minimum frequency of the considered band (for example, 2kHz lower frequency or even higher). The potential technological risks are evident: technology needs investments to be validated and the acquired audio signals would not be anymore listenable by humans due to the absence of the fundamental frequency ranges (limiting some of the possible applications). The new POC will be a small, light and handy device, useful for generating a spatial mapping of the audio sources (noise, but also voice and sounds). The final targeted dimensions will make Dual Cam compatible to tablet PC interfacing and docking (see later on for further details). After development, we will test the new POC in an anechoic chamber using acoustic sound sources with controlled frequencies and positions (different musical instrument sounds vs spatial recognition), and we will compare the results with the current Dual Cam system. Dual Cam is protected by the following two patents: the first concerning the method for designing the microphone displacement for the planar array (hardware), the latter the tracking source method (software).

1. IT GE2014A000118 Marco CROCCO, Samuele MARTELLI, Vittorio MURINO, Andrea TRUCCO “Metodo Di Tracciamento Di Una Sorgente Acustica Bersaglio” December 2014.

(22)

6.4 Market and Competition 107

2. IT 0001415813/PCT/IB2014/058466 Marco CROCCO, Vittorio MURINO, Andrea TRUCCO “Metodo Per La Configurazione Di Disposizioni Planari Di Trasduttori Per L’elaborazione Di Segnali A Banda Larga Mediante Beamforming Tridimensionale E Sistemi Di Elaborazione Di Segnali Che Utilizzano Tale Metodo, In Particolare Telecamera Acustica” January 2013

6.4 Market and Competition

The market strategies will be carried out with the primary purpose of intercepting the market sectors in which Dual Cam technology is technically relevant and profitable at the same time. Dual Cam, directly or through partner companies, will promote, distribute and sell the sensor. The operating methods of promotion and sale will be evaluated and chosen following a detailed analysis for each scenario and for each type of customer. In many scenarios the reference customers of Dual Cam will be Security System Integrators, Security System Planners and Distributors of sensors and safety devices (Figure 6.7). Commercial agreements, possibly of an exclusive type, will be stipulated if the guaranteed volume of business is substantial, as it is typically the case for companies producing video surveillance cameras or antennas. Furthermore, it is expected that, in strategic sectors, collaboration contracts will be stipulated with possibly multi-mandated sales agents, already selected and already operating in the highlighted sectors, aimed at promoting the new Dual Cam technology internationally and towards institutional customers otherwise particularly difficult to reach. For some vertical market sectors that the team considers more profitable and easier to attack directly with their sales force, such as the safety of stadiums or critical infrastructures, Dual Cam will be able to provide the end customer with the complete turnkey system, integrating it with the security and video surveillance platform that may already be present. However, the installation, commissioning and integration of the system will be outsourced to partner companies (system integrators) with suitably qualified personnel trained during events organized by Dual Cam. This solution will guarantee Dual Cam more margins on the sale of the device but also an additional cash flow from the design and sale of the complete security system. For the particular application of drone localization for the detection of shots and anomalous behavior during events and fairs, our business model is not based on a traditional approach that involves selling to the customer the hardware and a perpetual software license, but it is managed-services, i.e. the customer pays an annual or a time-limited fee and receives a service in return. In this way the customer does not have to bear too high costs if it just needs a temporally limited exploitation of Dual Cam technology.

(23)

Figure 6.7 Dual Cam Business Model Canvas.

The main goal of this strategy is to increase the percentage of potential market, to spread technology by retaining the customer and to exploit the market segments characterized by a substantial presence of mass media as an advertising showcase for Dual Cam technology.

6.4.1 Test case: marketing sector “Air protection/drone detection”

The growing incidence of security breaches by unidentified drones together with the rise in terrorism and illicit activities are key drivers for the potential growth of the anti-drone systems market. The global anti-drone market is expected to reach $1.14 billion by 2022 at a CAGR of 23.89% between 2017 and 2022. The military and defense segment is expected to determine the largest share of this market. However, the large number of civilian places such as critical infrastructure, public facilities and spaces and other soft targets that require protection from potential threats from drones is increasing the potential market sectors of antidrone systems. Security experts from all over the world (in Italy Mistero dell’Interno -Polizia di Stato, Ministero della Difesa - Arma dei Carabinieri) have to deal with these new challenges posed by civil drones and there will be a strong demand for efficient systems that guarantee protection against these threats. These considerations create the conditions to hypothesize that the expansion still underway in the civil drone market will be accompanied by the imminent explosion of the market relating to anti-drone technologies which must be

(24)

Figure 6.8 Dual Cam Competitors.

Figure 6.9 Competitors analysis in the location of drones referred to products based entirely or partially on acoustic technology.

(25)

modular and flexible in order to meet the needs of different end users. Our analysis has led us to estimate a potential market of almost 100.000 system installations which, considering conservatively 4 Dual Cam sensors for each installation, corresponds to a potential market of approximately 400.000 sensors. From the collected data, we assumed to penetrate the 30% of the European total potential market and the 20% of the USA one. Currently, there are no commercial solutions on the market that guarantee air protection in a robust and low-cost way in sensitive areas such as ports, power plants, prisons and representative buildings, or civil structures and citizens’ homes. Using the Dual Cam optical-acoustic technology, it is instead possible to create a security system capable of locating a drone with greater effectiveness than is possible with a standard video surveillance system, even in conditions of poor visibility such as in the case of fog, smoking or adverse weather conditions. Thanks to a proprietary software library based on the latest artificial intelligence and digital signal processing techniques, Dual Cam can in fact be considered an IoT smart sensor capable of automatically analyzing the scene in real time, providing new features that would not otherwise be possible with a video-based security sensor. Dual Cam can also be easily integrated with other complementary technologies and sensors (e.g. radar) and with existing security systems through standard communication protocols.

6.4.2 Target Market, Go-to-Market strategy and Impact

Based on research carried out through interviews with professionals in the security market and with potential buyers of the Dual Cam system, it has been hypothesized that a Dual Cam sensor can be sold to the end customer at the unit price of C 10.000 which, based on the hypothesis of a deduction of about 40% from the system integrator/security system designer who concluded the deal, corresponds to an income for Dual Cam startup for a single product of approximately C 6.000. This perspective is suggested by the fact that, in the first few years, Dual Cam must necessarily rely on distribution companies already operating in the security systems market, having to create a solid network for the marketing, distribution and installation of the product. The market linked to anti-drone security systems is constantly growing and this is also evidenced by the fact that the three competitors Dedrone7and Drone Shield8and Orelia9obtained in 2015 significant funding from Venture Capital funds, equal to US $12.9 Million respectively and US $950,000 compared to approximately US $50 million invested globally in start-ups related in general to the drone market. Dedrone’s CEO

7_{http://www.dedrone.com/en/} 8_{https://www.droneshield.com} 9_{http://dronebouncer.com/en}

(26)

Figure 6.10 Dual Cam Swot Analysis.

stated in an interview that he receives hundreds of purchase requests every day, and the single-microphone-based Drone Shield technology had a great media return as it was tested during the Boston Marathon and has been employed to protect the movie set during the filming of Star Wars Episode VII. Also the European Commission in 2016, within the Societal Challenges pillar as part of the Horizon 2020 research funding program, has allocated C5 million to develop a system for localization and neutralization of small UAVs capable of constituting a reliable and effective tool for law enforcement service. We want to ask a grant of 50K C (SME Instrument)10 for the engineering of a compact and market oriented 20 cm x 20 cm Dual Cam POC that transmits both audio and video data to a commercial tablet PC. The audio frames are captured by the audio sensors placed on the main Dual Cam module and the synchronous video part is captured by a video and/or a thermal/infrared imaging camera. These modules (i.e., video/thermal/infrared camera and Dual Cam module with onboard FPGA) are interfaced to the tablet PC with multiple USB buses. In turn the Tablet can be remotely connected to internet using Wi-Fi/LTE or the next generation 5G for control and sharing of the captured data. The Dual Cam system will be battery powered and will be dockable to the Tablet PC using a mechanical coupling. The Dual Cam module will include power buttons, LEDs, debugging ports. The Tablet display will be used to visualize and process the captured data using deep learning networks (see for instance, Microsoft Surface Pro11 or Dell Latitude 7220EX12).

10_{https://ec.europa.eu/programmes/horizon2020/en/h2020-section/eic-accelerator-pilot}

11https://www.microsoft.com/it-it/p/surface-pro-7/8n17j0m5zzqs?activetab=pivot%3aoverviewtab

(27)

6.5 Business Plan 112

6.5 Business Plan

We plan our activities to build up the new POC in 1 year (52 weeks) identifying milestones and deliverables (Figure 6.11).

6.5.1 Milestones

• Audio/HW/SW/Mechanical POC integration • Final Demo

6.5.2 Deliverables

• Audio/HW/SW/Mechanical POC integration

• Audio simulations and HW specifications/requirements • FPGA/Firmware release

• SW Release

• Test and validation report

• Patentability analysis of the developed features report

• Scientific production (papers, communications in the major congress/conferences and so on, after patenting analysis)

6.5.3 Resources and Budget

Receiving the grant, we have to manage 50k C to accomplish the project and its activities in 1 year to build up the new POC and move from TRL 5 to TRL 6 (Figure 6.12).

6.6 Feasibility study

We want to increase of the minimum frequency of the considered bandwidth (for example, 2kHz as the lowest frequency or even more) thus reducing the size of the array and at the same time improving directivity. To simulate the new POC planar array I followed the approach proposed in [93]

(28)

6.6 Feasibility study 113

Figure 6.11 Timeline of the project.

(29)

Figure 6.13 Beam Pattern in 2D.

6.6.1 Planar array beamforming simulation

In the simulation of the beampattern of a planar array of microphones we have two angle of arrival θ and φ , two steering angles θ0and φ0and two coordinates for the microphones xn and yn. The expression of the BP is:

BP(θ , ϕ, f ) = N−1

∑

n=0 W_n( f )e− j2π f · xn· sin(θ )−sin(θ0) c +yn· sin(ϕ)−sin(ϕ0) c (6.1)

where N is the number of microphones. In order to find the position of the microphones, we have to minimize a cost function:

J(w, d) = Z θ_0max θ_0min Z ϕ_0max ϕ_0min Z θmax θmin Z ϕmax ϕmin Z f_max fmin |B (w, d, θ , ϕ, θ0, ϕ0, f ) − 1|2+ C |B(w, d, θ , ϕ, θ0, ϕ0, f )|2d f dθ dϕdθ0dϕ0 (6.2)

where d is the vector with the positions of the microphones and w is the vector of the filter coefficients. The first term of the cost function is an adherence term, instead the second one joints optimization of weights and positions and accounts for superdirectivity and aperiodicity. Then the expression of the BP 2D becomes:

B(w, d, θ0, ϕ0, θ , ϕ, f ) = N−1

∑

n=0 K−1

∑

k=0 wn,ke − j2π f · xn sin(θ )−sin(θ0) c +yn· sin(ϕ)−sin(ϕ0) c +kTc (6.3)

(30)

where K is the lenght of the FIR filter and Tc is the sampling period. We put a change in variables:

(

u= sin(θ ) − sin (θ0) v= sin(ϕ) − sin (ϕ0)

(6.4) A new cost function can be defined as follows:

J(w, d) = Z umax umin Z vmax vmin Z fmax fmin |B(w, d, u, v, f ) − 1|2+C|B(w, d, u, v, f )|2d f dudv (6.5)

The new cost function is a good approximation of the original one, allowing to reduce the number of integrals.

Microphones errors and robustness

Once again (see Chapter 3) we perform an optimization of the mean performance i.e. the multiple integrals of the cost function over the sensors’ phase and gain An = an· e−γn considered as random variables, getting a "robust" cost function with the PDF (probability density function) of the random variable An:

Jtot(w, d) = Z A0 · · · Z AN−1 J(w, d, A0, . . . , AN−1) · fA(A0) · · · fA(AN−1) dA0· · · dAN−1 (6.6) Substituting u and v into the BP expression we obtain:

B(w, d, u, v, f ) = N−1

∑

n=0 K−1

∑

k=0 w_n,ke− j2π f ·[xnu_c+ynv_c+kTc] _(6.7)

A new cost function can be defined as follows: J(w, d) = Z umax umin Z vmax vmin Z fmax fmin |B(w, d, u, v, f ) − 1|2+C|B(w, d, u, v, f )|2d f dudv (6.8) where           

u_max= sin (θmax) − sin (θ0 min) u_min= sin (θmin) − sin (θ0 max) vmax= sin (ϕmax) − sin (ϕ0 min) v_min= sin (ϕmin) − sin (ϕ0 max)

(6.9)

The new cost function is a good approximation of the original one, allowing to reduce the number of integrals. The vector w can be extracted from the multiple integrals in the robust

(31)

cost function obaining a quadratic form in w.

Jtot(w, d) = w · M(d) · wT− 2w · rT(d) + s (6.10) Under opportune hypotheses also the integrals on the variables Ancan be extracted from the matrix M and the vector r and calculated in closed form obtaining:

Jtot(w, d) = w · A ⊗ ˜M(d) · wT − 2w · a ⊗_erT(d) + s (6.11) For each element of ˜M and ˜r the integral on the frequency can be calculated in closed form, moreover it can be demonstrated that the two integrals in u and v can be converted into a single integral.

Minimization strategy

For a fixed microphone displacement the global minimum of the robust cost function can be calculated in a closed form. On the contrary the presence of local minima in respect to the microphone position prevents from using gradient-like iterative methods. Final solution is given by an hybrid strategy analytic and stochastic based on Simulated Annealing algorithm:

• Iterative procedure aimed at minimizing an energy function f (y).

• At each iteration, a random perturbation is induced in the current state yi.

• If the new configuration, y∗, causes the value of the energy function to decrease, then it is accepted.

• If y∗causes the value of the energy function to increase, it is accepted with a probability dependent on the system temperature, in accordance with the Boltzmann distribution. • The temperature is a parameter that is gradually lowered, following the reciprocal of

the logarithm of the number of iterations.

• The higher the temperature, the higher the probability of accepting a perturbation causing a cost increase and of escaping, in this way, from unsatisfactory local minima.

6.6.2 Tests and results

(32)

Figure 6.14 Flow chart simulated annealing for function cost minimization.

Figure 6.15 First attempt results: left cost function versus number of iterations, right final position of microphones.

• Planar bi-dimensional array 25 cm x 25 cm; d = 0,25 m • N° microphones = 32

• K = 11 taps (FIR length) • Fsampling = 12800 Hz

• Sensor’s Bandwidth = (2000 ÷ 6400) Hz

• Few iterations (to test quickly the trend of the results)

We put the output in terms of behaviour of the cost function versus number of iterations and final position of the microphones (Figure 6.15). Even if the simulation has been done with few iterations, the results are aligned with the expectations. In the BPs (Figure 6.16), increasing the frequency the grating lobes presence is highlighted. The aperiodic displacement (not

(33)

Figure 6.16 First attempt results: BPs ideal at several frequencies.

pushed) of the microphones is consistent with the fact that the number of iterations is low. The image quality will improve as the main lobe shrinks and the side lobes decrease, so we have to take it into account. A question is arising: is still necessary the use of an aperiodic array with a reduced array’s dimension and bandwidth (Figure 6.19)?

Metrics of evaluation

We need, as we have already seen in Chapter 3, some metrics to evaluate the algorithm’s performances. Once again they are Directivity and W NG. For a planar array the Directivity (measured in [dB]) definition is:

S_a( f ) = |BP (θ0, ϕ0, f )| 2 1 4π R2ππ 0 BP(θ , ϕ, f ) 2 sin(θ )dθ dϕ (6.12)

W NGmeasures the system robustness towards imperfections in the array characteristics. For a planar array the White Noise Gain (measured in [dB]) definition is:

Ss( f ) = BP(θ0, ϕ0,f) 2 ∑N−1_n=0 |Hn( f )|2 (6.13)

We introduce once again the expected beam pattern power (EBPP) which evaluates the expectations of the squared modulus of the perturbed beam pattern and allows to forecast the impact on the BP of a given variance in the array characteristics (gain and phase of the

(34)

Figure 6.17 First attempt results: Directivity and WNG.

microphone responses). B2_e(θ , ϕ, f ) = En|Br(θ , ϕ, f )|2 o =R A0. . . R AN−1|Br(θ , ϕ, f )| 2 · fA0(A0) · · · fAN−1(AN−1) dA0· · · dAN−1 (6.14)

Under the assumption of small variances can be approximated as:

B2_e(θ , ϕ, f ) = |B(θ , ϕ, f )|2+ 1 S_s( f ) σ_g2+ σ_ψ2 (6.15)

The second term sets a threshold below which the EBPP cannot decrease. We can evaluate then not only the Directivity, W NG but the EBPP as well of our first simulation taking for the Gaussian distribution of the microphones’ mismatches σg= 0.03 = 3% and σψ= 0.035rad ∼= 2◦ (Figure 6.18): this quick simulation gives overall good results in order to reshape the sensor for Dual Cam 3.0. We ran the same data and geometry a simulation imposing in the cost function a periodic displacement of the microphones, getting the synthesis of the FIR filter to build up once again BP, W NG and Directivity (Figure 6.19).

We can see that the aperiodic layout of the microphones with the setting chosen is still necessary to avoid grating lobes specially at higher frequencies. Next step is produce another simulation with higher number of interactions, higher number of microphones and FIR filter length (taps), to compare with the current one. The final goal is to get a balanced simulation for a square array ∼ 20 cm x 20 cm with good W NG, BP and Directivity. Starting from this set of parameters I tried to reduce the planar array aperture (u and v range) to adjust the FOV (field of view) and to increase the number of iterations. I’ve found this setting:

• d = 21 cm

(35)

Figure 6.18 First attempt results: BPs vs EBPPs.

(36)

Figure 6.20 Second attempt: left cost function versus number of iterations, right final position of microphones.

• K = 31 (FIR lenght) • u ∈ [-1,5;1.5] • v ∈ [-1,5;1.5] • 1000 iterations

But even if we increase the number of iterations till 40K, the simulation in not able to suppress completely the grating lobes at higher frequency, even in an optimized FOV. Next step has been then to try to enlarge the main lobe dimensions, because in this way we change the area where the algorithm tries to lower everything and if there are lobes it has more degrees of freedom to try to break them down. Instead, we saw that if we increase the number of taps, sometimes we get an unstable and unuseful simulation. I played on these main lobe parameters (in addition to the tests already done) to see what happens. I report the results of the following simulation (Figures 6.20, 6.21, 6.22, 6.23):

• d = 21 cm

• N° of microphones = 32 mic • K = 31 (FIR lenght)

• u ∈ [-1,5;1.5] ; v ∈ [-1,41;1.41] • N° of iterations = 10000

• uMainLobelow= -0.2 ; uMainLobehigh= 0.2 • vMainLobelow= -0.2 ; vMainLobehigh= 0.2

A sub-range of u and v optimizes the BPs avoiding grating lobes at higher frequencies, even if this action reduces of course the FOV of Dual Cam 3.0.

(37)

Figure 6.21 Second attempt: Directivity and WNG.

Figure 6.22 Second attempt: BP and EBPP.

(38)

Comparison with current device

We want to compare the performances of the current device (wider bandwidth) with the new one (shorter bandwidth and dimensions). We simulate then this current experimental condition: • d = 45 cm • N° of microphones = 128 mic • K = 7 (FIR lenght) • u ∈ [-1,5;1.5] ; v ∈ [-1,41;1.41] • N° of iterations = 10000

• uMainLobelow= -0.06 ; uMainLobehigh= 0.06 • vMainLobelow= -0.06 ; vMainLobehigh= 0.06 • Bandwith = [500 : 6400] Hz

Dual Cam 2.0 has a better WNG and Directivity specially at low frequencies (Figure 6.25); also the main lobe of the BP and EBPP is sharper (Figure 6.26). Instead the grating lobes at high frequencies are more or less the same. Next step is to test the effect of a wider bandwidth in the Dual Cam new POC FIR synthesis. We simulate then this set of parameters:

• d = 21 cm

• N° of microphones = 32 mic • K = 31 (FIR lenght)

• u ∈ [-1,5;1.5] ; v ∈ [-1,41;1.41] • N° of iterations = 10000

• uMainLobe_low= -0.2 ; uMainLobe_high= 0.2 • vMainLobelow= -0.2 ; vMainLobehigh= 0.2 • Bandwith = [500 : 6400] Hz

(39)

Figure 6.24 Current Dual Cam prototype: Function cost minimization and microphones positions.

Figure 6.25 Current Dual Cam prototype: Directivity and WNG.

The results are very promising (Figures 6.27, 6.28, 6.29): we can use the current bandwidth of Dual Cam for the new POC. We are able to keep the bandwidth of the previous device but with compact dimensions. We lose of course a little bit in terms of resolution of the main lobe and some grating lobes at higher frequencies became dangerous. We try to improve the simulation pushing up as much as possible the number of microphones to recover the loss of resolution, but we need to shorter the FIR filter length to make convergent the simula-tion, cutting at the same time a little bit the FOV to face the grating lobes at higher frequencies.

(40)

Figure 6.27 Dual Cam 3.0 with larger audio bandwidth: Cost function minimization and microphones positions.

Figure 6.28 Dual Cam 3.0 with larger audio bandwidth: Directivity and WNG.

(41)

6.7 Conclusions 126

6.7 Conclusions

Dual Cam is an innovative product, which offers a series of personalized services on different application scenarios, with a consolidated and close-knit group of actors supported by one of the most active research institute in Italy today, as well as competitive with the major organizations of the same type at the world level. These peculiarities will allow to guarantee the development of the POC/product and to effectively deal with the EU and non-EU markets, maintaining a considerable technological competitive advantage over time, necessary to extend the market at a global level and to better face the competition especially with companies on the Asian continent, characterized by multinational organizational structures and aggressive pricing policies. From a technical point of view we demonstrated to be able to simulate a new compact device, and we planned the activities needed to build it up in 1 year with an investment of 50K C (grant lump sum).

(42)

Chapter 7 Conclusions

7.1 Wrap-up and Future Developments

The objective of the thesis has been the development of a set of innovative algorithms around the topic of beamforming in the field of acoustic imaging, audio and image processing. In the first part, I presented some new techniques in order to reconstruct the ultrasound images using plane waves which implies a platform of the equipment software based rather then hardware based. This new opportunity is possible thanks to the introduction of the GPUs and the new modules able to deal with the computational load required. I highlighted the advantages using the image reconstruction algorithm in the frequency domain, which leads to a novel and patented beamforming algorithms based on seismic migration and the masking data in the kx− ω space. The future work and natural evolution is to develop and test the algorithm in condition of absence of the homogeneity of the medium (phase aberration conditions and/or image degradation). I then presented and compared the metrics of the synthesis of two different methods of simulation, following two different philosophies, to get the synthesis of FIR’s coefficient filters for an efficient and robust superdirective beamformer to target audio applications in a real experimental scenario using a compact linear array of microphones. The main drawback of the two methods presented is the limitation on the choice of the cost function forced by the convexity conditions. In particular, there is no way to differentiate between main lobe and side lobe region, or to impose a worst-case design by minimizing the maximum of the side lobes. Moreover the cost functions are quadratic so that low energy regions of BP are not very weighted in the cost function. Working with different representations (logarithmic) of BP could allow for a better shaping of low energy regions. For all these reasons, for further development of a new and better method of synthesis, the cost function should be modified to lose the convexity property. Then,

(43)

7.1 Wrap-up and Future Developments 128

to face the related problem of local minima, it would be necessary to take into account heuristics algorithms such as genetic algorithms. The simulations presented allow us to point out, for the two compared design methods, the tradeoff between performance (directivity), invariance, robustness (W NG), and sensor accuracy. They represent a starting base for further investigations the reader can perform, providing an insight on the parameters to modify in order to achieve the desired performance. Later in the thesis, I generalized the traditional experimental playground in which the notion of cross-correlation identity (CCI), applied to the estimation of TDOAs using blind channel deconvolution methods, switching from the case N = 2 to N > 2. The analysis shows that, by simply allowing for a increased number of microphones, the very same state-of-the-art method ILC1 can be sharply boosted in performance without requiring any change in the computational pipeline. We deem that our findings open up to a novel research trend in which CCI identities are better combined with the case N > 2, so that improvements in the error metrics can come from two different, yet complementary, factors: advances in the optimization standpoint and multiple CCI relationships. We warm-up the research efforts in this directions with two simple modifications of IL1C, showing that, with respect to an incremental addition of the microphones, the practitioners should preferred a late fusion ensemble mechanism. Finally I investigated the potential of acoustic images acquired through the Dual Cam device, in a novel self-supervised learning framework and with the aid of a new multimodal dataset, specifically acquired for this purpose. Evaluating the trained models on classification and cross-modal retrieval downstream tasks, we have shown that acoustic images are a powerful source of self-supervision and their information can be distilled into monaural audio and audio-visual representation to make them more robust and versatile. Moreover, features learned with the proposed method can generalize better to other datasets than representations learned in a supervised setting. For the next step we recorded a musical instruments dataset outdoors, played by musicians. We both recorded single instruments and couples of them. We will use it for future work, aiming to distinguish two sounds in the same scene associating each sound with each class in the video frame. If we could do this, we would be able to improve audio-visual localization when many objects are in the scene and maybe they are also producing together noise. We collected a musical instruments dataset (9 instruments) recorded both single and pairs of instruments to use audio-visual information to distinguish two or more sounds in the same frame to improve localization when many objects are present in the scene making sound. Dual Cam is an innovative product, which offers a series of personalized services on different application scenarios, with a consolidated and close-knit group of actors supported by one of the most active research institute in Italy today, as well

(44)

7.1 Wrap-up and Future Developments 129

as competitive with the major organizations of the same type at the world level. These peculiarities will allow to guarantee the development of the POC/product and to effectively deal with the EU and non-EU markets, maintaining a considerable technological competitive advantage over time, necessary to extend the market at a global level and to better face the competition especially with companies on the Asian continent, characterized by multinational organizational structures and aggressive pricing policies. From a technical point of view we demonstrated to be able to simulate a new compact device to dock the planar microphones array in the back of a commercial tablet/pc matching its dimensions, and we planned the activities needed to build it up in 1 year with an investment of 50k C in R&D.

(45)

References

[1] M. Crocco and A. Del Bue, “Estimation of tdoa for room reflections by iterative weighted l1 constraint,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3201–3205, IEEE, 2016.

[2] J. Bercoff, “Ultrafast ultrasound imaging,” Ultrasound imaging-Medical applications, pp. 3–24, 2011.

[3] C. M. Christensen and C. M. Christensen, The innovator’s dilemma: The revolutionary book that will change the way you do business. HarperBusiness Essentials New York, NY, 2003.

[4] T. L. Szabo, Diagnostic ultrasound imaging: inside out. Academic Press, 2004. [5] B. Delannoy, “Ultrafast electronic image reconstruction device,” Echocardiology, vol. 1,

pp. 447–450, 1979.

[6] D. P. Shattuck, M. D. Weinshenker, S. W. Smith, and O. T. von Ramm, “Explososcan: A parallel processing technique for high speed ultrasound imaging with linear phased arrays,” The Journal of the Acoustical Society of America, vol. 75, no. 4, pp. 1273–1282, 1984.

[7] S. W. Smith, H. G. Pavy, and O. T. von Ramm, “High-speed ultrasound volumetric imag-ing system. i. transducer design and beam steerimag-ing,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 38, no. 2, pp. 100–108, 1991.

[8] O. T. Von Ramm, S. W. Smith, and H. G. Pavy, “High-speed ultrasound volumetric imaging system. ii. parallel processing and image display,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 38, no. 2, pp. 109–115, 1991. [9] L. Sandrin, S. Catheline, M. Tanter, and M. Fink, “2d transient elastography,” in

Acoustical Imaging, pp. 485–492, Springer, 2002.

[10] J. A. Jensen, O. Holm, L. J. Jerisen, H. Bendsen, S. I. Nikolov, B. G. Tomov, P. Munk, M. Hansen, K. Salomonsen, J. Hansen, K. Gormsen, H. M. Pedersen, and K. L. Gam-melmark, “Ultrasound research scanner for real-time synthetic aperture data acquisition,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 52, no. 5, pp. 881–891, 2005.

[11] T. Schiwietz, T.-c. Chang, P. Speier, and R. Westermann, “Mr image reconstruction using the gpu,” in Medical Imaging 2006: Physics of Medical Imaging, vol. 6142, p. 61423T, International Society for Optics and Photonics, 2006.

(46)

References 131

[12] F. Xu and K. Mueller, “Real-time 3d computed tomographic reconstruction using commodity graphics hardware,” Physics in Medicine & Biology, vol. 52, no. 12, p. 3405, 2007.

[13] S. Rosenzweig, M. Palmeri, and K. Nightingale, “Gpu-based real-time small displace-ment estimation with ultrasound,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 58, no. 2, pp. 399–405, 2011.

[14] Jian-Yu Lu, “Experimental study of high frame rate imaging with limited diffraction beams,” IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 45, no. 1, pp. 84–97, 1998.

[15] J. A. Jensen, O. Holm, L. Jerisen, H. Bendsen, S. I. Nikolov, B. G. Tomov, P. Munk, M. Hansen, K. Salomonsen, J. Hansen, et al., “Ultrasound research scanner for real-time synthetic aperture data acquisition,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 52, no. 5, pp. 881–891, 2005.

[16] G. Montaldo, M. Tanter, J. Bercoff, N. Benech, and M. Fink, “Coherent plane-wave compounding for very high frame rate ultrasonography and transient elastography,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 56, no. 3, pp. 489–506, 2009.

[17] J. Petersson, J.-O. Christoffersson, and K. Golman, “Mri simulation using the k-space formalism,” Magnetic Resonance Imaging, vol. 11, no. 4, pp. 557–568, 1993.

[18] M. Jaeger, S. Schüpbach, A. Gertsch, M. Kitz, and M. Frenz, “Fourier reconstruction in optoacoustic imaging using truncated regularized inverse k-space interpolation,” Inverse Problems, vol. 23, no. 6, p. S51, 2007.

[19] P. Kruizinga, F. Mastik, N. de Jong, A. F. van der Steen, and G. van Soest, “Plane-wave ultrasound beamforming using a nonuniform fast fourier transform,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 59, no. 12, pp. 2684–2691, 2012.

[20] K. Kostli, D. Frauchiger, J. J. Niederhauser, G. Paltauf, H. P. Weber, and M. Frenz, “Optoacoustic imaging using a three-dimensional reconstruction algorithm,” IEEE

Journal of Selected Topics in Quantum Electronics, vol. 7, no. 6, pp. 918–923, 2001. [21] D. Garcia, L. Le Tarnec, S. Muth, E. Montagnon, J. Porée, and G. Cloutier, “Stolt’s

fk migration for plane wave ultrasound imaging,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 60, no. 9, pp. 1853–1867, 2013.

[22] F. Bertora, P. Pellegretti, A. Questa, C. Parodi, and A. Trucco, “An alternative frequency domain beamforming,” in IEEE Ultrasonics Symposium, 2004, vol. 3, pp. 1749–1752, IEEE, 2004.

[23] D. Theis and E. Bonomi, “Seismic imaging method for medical ultrasound systems,” Physical Review Applied, vol. 14, no. 3, p. 034020, 2020.

[24] P. R. Hoskins, “Simulation and validation of arterial ultrasound imaging and blood flow,” Ultrasound in medicine & biology, vol. 34, no. 5, pp. 693–717, 2008.