BigEar : a low cost wireless distributed audio capturing system for ubiquitous automatic speech recognition

(1)

POLITECNICO DI MILANO

Como Campus

M.Sc. in Computer Science and Engineering Department of Electronics, Information and Bioengineering

BigEar: A LOW COST WIRELESS DISTRIBUTED AUDIO CAPTURING

SYSTEM FOR UBIQUITOUS AUTOMATIC SPEECH RECOGNITION

Advisor: Prof. Fabio Salice

Tutor: Eng. Hassan Saidinejad

Master Graduation Thesis of:

Stefano Gorla, ID 745603

(2)

(3)

POLITECNICO DI MILANO

Polo Regionale di Como

Corso di Laurea Magistrale in Ingegneria Informatica Dipartmento di Elettronica, Informazione e Bioingegneria

BigEar: SISTEMA A BASSO COSTO, WIRELESS, DISTRIBUITO DI

ACQUISIZIONE AUDIO PER IL RICONOSCIMENTO AUTOMATICO DEL

PARLATO

Relatore: Prof. Fabio Salice

Tutore: Ing. Hassan Saidinejad

Tesi di Laurea di:

Stefano Gorla, matricola 745603

(4)

(5)

Ai miei genitori, che mi hanno sempre sostenuto con la pazienza, la comprensione e il sacrificio.

(6)

(7)

Acknowledgements

First of all I would like to thank Prof. Fabio Salice deeply, for his valuable and helpful advice, support and encouragements during this work. Special thanks to Eng. Ashkan Saidinejad for his useful and important tutoring. Without their support and their help, the pages of this work would be white.

Also I have to mention the names of Prof. Augusto Sarti and Prof. Giuseppe Bertuccio for their willingness; a special thank to Eng. Marco Leone and Eng. Carlo Bernaschina for their help in the prototype realization and implementation.

Thanks to Emanuele De Bernardi, Fabio Veronese, Simone Mangano, who shared with me important moments of social brainstorming, but also lunch breaks and a lot of coffee.

During my long study career, I met many people, and many of them shared part of my journey. I want to thank first my family, dad and mom, my brother Marco with his wife Sonia, my nephews Anna and Giulia, my sisters Elena with her husband Pietro and Agnese, with Eleonora, Lucrezia and of course in our hearts Mauro.

Thank you to the people with whom I share my passions: my choir, the diocesan school of music, don Rinaldo, don Simone, don Nicholas, Celeste, Andrea and Carlo; Filippo, Laura and the friends from Castel San Pietro. Thanks to Rita, Ugo and the friends of AdFontes for their infectious enthusiasm. Thank you to the Non-violent guys group, to the Benzoni’s family, and to don Felice Rainoldi: not only an outstanding teacher but specially a true friend.

Thank you to the music: a strange world where you can escape, but where, as always, are the best inspirations.

The final thank goes to Paola. Thank you for having always believed in me and for giving me day by day the right encouragement to continue my work in many hard times. Thank you.

(8)

(9)

Abstract

T

HEageing of world’s population will raise the demand and challenges of elderly care in coming

years. Several approaches have been devised to deal with the needs of older people proactively. Assistive domoticsrepresents a relatively recent effort in this direction; in particular, Vocal Inter-action can be a favored way to control the smart home environment, providing that the interface fulfills requirements of transparency and unobtrusiveness. Absence of intrusive devices induces a more natu-ral interaction model in which there is no need to wear a microphone or to issue commands to specific “hotspots” of the dwelling.

From these assumptions, a wireless, modular and low-cost speech capturing system has been imple-mented, in which a set of wireless audio sensors send captured data to a Base Station, which in turn is responsible for aggregating received data in order to rebuild the speech captured distributely by BigEar Audio Sensors. The reconstruction algorithm performs a first stage of energy and delay analysis of the audio streams coming from the sensors; this stage is needed for compensating energy and delay dif-ferences due to different source-sensors distances. Then, the streams are superposed in order to merge each contribution into a unique output stream. Each sensor generates an audio stream that, depending on network interaction model, will be not complete, presenting thus sequences of silence (holes). Holes in the reconstructed signal drastically decrease the accuracy of the speech recognition procedure. Although BigEar system, working with the best network parameters, ensures high chances for a successful speech recognition, four different methods have been used to repair the audio signal before sending it to the speech recognition block.

BigEar architecture has been simulated by means of a MATLAB-based simulator that allows to study the whole system behavior, from the environment (room) acoustic simulation up to Network Interaction protocol. Once best parameters are pointed out by the simulation, a real-world prototype has been realized.

From Results Analysis it can be seen that BigEar can be identified as a minimum cost, wireless and distributed system. Moreover, ubiquitous approach allows Data Intelligence mechanisms, e.g. per-forming a coarse-grain localization - using sensor signal power and delays information - in order to add informative content that could disambiguate context-free vocal commands (“Turn off the light” or - bet-ter - “Turn off this light” could be integrated with localization information in order to debet-termine which light has to be switched off).

(10)

(11)

Sommario

L’

INVECCHIAMENTOdella popolazione mondiale nei prossimi anni causerà l’accrescimento

del-la domanda e le redel-lative sfide neldel-la cura degli anziani. Sono stati studiati numerosi approcci per far fronte alle esigenze delle persone meno giovani. La domotica assistiva rappresenta un passo - seppur relativamente recente - in questa direzione; in particolare, l’interazione vocale può essere una via preferenziale per controllare l’ambiente domestico gestito dalla domotica, a patto che l’interfaccia uomo-macchina soddisfi requisiti di trasparenza e non-intrusività. L’assenza di dispositi-vi invasidispositi-vi induce a una interazione più naturale in cui non sia necessario indossare un microfono o impartire comandi attraverso specifici punti di ascolto dell’abitazione.

A partire da queste premesse è stato implementato un sistema modulare, senza fili e a basso costo di cattura del parlato in cui un insieme di sensori audio invia i dati acquisiti a una Base Station, la quale provvede ad assemblare i dati ricevuti in modo da ricostruire il parlato precedentemente acquisito, in modo distribuito, dai sensori. L’algoritmo di ricostruzione esegue anzitutto l’analisi energetica e dei ri-tardi dei flussi audio; questa operazione è necessaria alla compensazione delle differenze energetiche e dei ritardi dovuti alle distanze tra i sensori e la sorgente. Dopodiché, i flussi audio vengono sovrapposti in modo da fondere i singoli contributi in un unico flusso audio. A seconda del modello di interazione di rete, ogni sensore genera un flusso audio che presenta sequenze di silenzio (buchi). I buchi nel segna-le ricostruito diminuiscono drasticamente l’accuratezza della procedura di riconoscimento del parlato. Sebbene il sistema BigEar, quando configurato per operare con i migliori parametri di rete, assicuri alte probabilità affinché il riconoscimento dia risultati positivi, sono stati testati quattro diversi metodi per riparareil segnale audio prima che esso venga inviato al blocco di riconoscimento del parlato.

L’architettura è stata testata tramite un simulatore basato su codice MATLAB che permette di studiare il comportamento dell’intero sistema, dalla simulazione acustica dell’ambiente fino ai protocolli usati per le interazioni di rete. A partire dai parametri ottimi indicati dal simulatore è stato realizzato un prototipo reale.

Dall’analisi dei risultati si può notare che BigEar può essere identificato quale sistema a costo mi-nimo, senza fili e modulare. Inoltre, l’approccio distribuito al problema permette meccanismi di In-telligenza dei Dati, ad esempio eseguendo una localizzazione sommaria della sorgente (sfruttando le informazioni di ritardo e di energia del segnale) che permetta di aggiungere contenuto informativo in grado di disambiguare comandi vocali privi di contesto (i comandi “Spegni la luce” o - meglio - “Spegni questaluce” potrebbero essere integrati con una localizzazione in grado di determinare quale luce debba essere spenta).

(12)

(13)

List of Figures

2.1 Overview of the BigEar architecture . . . 8

2.2 Wixel module pinout and components . . . 8

2.3 CC2511 ADC Block diagram . . . 13

2.4 BigEar Receiver Logic . . . 14

2.5 BigEar Application Reconstruction Logic . . . 15

2.6 Aloha vs. Slotted Aloha throughput . . . 17

3.1 Architecture model . . . 20

3.2 Geometry of wave propagation from a point source x1to a listening point x2. . . 21

3.3 Point-to-point spherical wave simulator . . . 21

3.4 Law of reflection . . . 22

3.5 Geometry of an acoustic reflection caused by “multipath” propagation . . . 22

3.6 Image source method using two walls, one source and one listening point. . . 24

3.7 MCROOMSIM Data flow from configuration stage to the end of simulation . . . 26

3.8 Simulation and reconstruction flowchart . . . 29

3.9 Flowchart of the radio transmission model . . . 31

3.10 Buffer Frames sliding . . . 33

3.11 Flowchart of the N-buffer model . . . 34

4.1 Flowchart of the signal reconstruction block . . . 37

4.2 Alignment of audio packets considering their timestamp . . . 37

4.3 Flowchart of the correlation analysis block . . . 38

4.4 Representation of a holey signal . . . 39

4.5 How summation of different biases might influence the reconstructed signal . . . 40

4.6 Harmonic distortion due to unweighted sum of contributions . . . 42

4.7 Cross-correlation test between noisy and holey signals . . . 45

4.8 Cross-correlation analysis and alignment on signal’s envelopes . . . 46

4.9 B-Spline method for healing reconstructed signal . . . 47

4.10 G.711.I Frame erasure concealment algorithm . . . 48

4.11 Gabor spectrograms of original, reconstructed and inpainted streams . . . 51

5.1 BigEar Audio Capture Board Biasing circuit . . . 54

(18)

5.3 LT1013 OpAmp Voltage VS. Frequency Gain . . . 56

5.4 Frequency response of the simulated signal conditioning circuit . . . 58

5.5 Flowcharts of the BigEar Audio Capture application . . . 60

5.6 Signal and Saturation LED indicators policy . . . 63

5.7 BigEar Data field structure . . . 64

5.8 Flowchart of the BigEar Receiver application . . . 67

5.9 BigEar SerialPort raw file formats . . . 68

5.10 BigEar SerialPort Capture flowchart . . . 69

5.11 BigEar SerialPort Raw-to-CSV decoding . . . 71

5.12 BigEar SerialPort Class Diagram . . . 72

5.13 BigEar Reconstructor Flowchart . . . 73

6.1 Reconstructed signal metrics plotted as a function of TmaxDelay parameter . . . 81

6.2 F ill_ratio as a function of TmaxDelay . . . 82

6.3 BigEar Reconstructor RPR metric . . . 85

6.4 BigEar Capture Board Application timing analysis . . . 85

6.5 BigEar Reconstructor PAR metric . . . 86

6.6 Signal power vs. alignment delay plot for coarse-grain localization . . . 88

(19)

List of Tables

2.1 Physical characterization of the Typical Situation . . . 11

2.2 ADC: Speed of Conversion VS. Decimation Rate . . . 13

4.1 Noise Addition for Speech Signal Repair . . . 48

4.2 Healing methods comparison . . . 52

5.1 Design Choices for Signal Conditioning stage . . . 58

5.2 BigEar Audio Capture application Design Choices and Parameters . . . 66

(20)

(21)

1

Introduction

Q

UALITYof life depends heavily on the efficiency, comfort and cosiness of the place an individual

calls “home”. Thus, a wide range of products and systems have been invented in the past to advance human control over the entire living space. Domotics is a field specializes in specific automation techniques for private homes, often referred to as “home automation” or “smart home tech-nology”. In essence, home environment control and automation extend the various techniques typically used in building automation, such as light and climate control, control of doors and window shutters, surveillance systems, etc. through the networking of ICT in the home environment, including the inte-gration of household appliances and devices. Such solutions are not only offering comfort and security, but when serving an elderly or a person with disability (a fragile person in general) can leverage safety and individual independence. Assistive domotics represents a relatively recent effort in this direction that further specializes in the needs of people with disability, older persons, and people with little or no technical affinity, and which seeks to offer such residents new levels of safety, security and comfort, and thereby the chance to prolong their safe staying at home.

The ageing of world’s population will raise the demand and challenges of elderly care in coming years. Based on a study of the US census, the number of people aged over 65 will increase by 101 percent between 2000 and 2030, at a rate of 2.3 percent each year; during that same period, the number of family members who can provide support for them will increase by only 25 percent, at a rate of 0.8 percent each year. Several approaches have been devised to deal with the needs of older people proactively.

1.1 BRIDGe - Behaviour dRift compensation for autonomous and

InDe-pendent livinG

The BRIDGe1_{(Behaviour dRift compensation for autonomous InDependent livinG) project [17], being}

done at Politecnico di Milano - Polo di Como, aims to build strong connections between a person living independently at home and his or her social environment (family, caregivers, social services, and so on)

(22)

by implementing a system that provides focused interventions according to the user’s needs.

The target of Bridge are people with mild cognitive or physical impairments and, more generally, fragile people whose weakness threatens their autonomy, health or other important aspects of life. Fragile people need for mutual reassurance: they typically want to be independent and autonomous, but they also know that often somebody else must be present to help them with unexpected needs.

At Bridge’s core is a wireless sensor-actuator network that supports house control and user behavior detection through a rich and flexible communication system between the person and his or her social environment, aimed at reassuring both the family and the user.

One of the simpler ways the user can adopt for meeting his needs is to ask someone to do something: ask to switch off the light, to unlock the door, to increase the room temperature, etc. Inhabitants of the “smart home” can vocally control some parts of a dwelling (lights, doors, principal gate, and so on) by means of issuing commands to a Vocal Command Interface (VCI).

The inhabitant starts by activating the VCI, which is continuously listening and waiting for the per-sonalized activation keyword. Activation is confirmed to the user through a vocal message, and the VCI is now ready to receive the inhabitant’s vocal command (for example, “turn on the light in the kitchen”). The vocal command is analyzed in an ad hoc application and the result of the interaction is the execution of the vocal command.

1.2 Ubiquitous Computing

Ubiquitous computing is a paradigm in which the processing of information is linked with each activity or object as encountered. It involves connecting electronic devices, including embedding microprocessors to communicate information. Devices that use ubiquitous computing have constant availability and are completely connected.

Ubiquitous computing is one of the keys to success of Bridge system: the more the sensors network is unobtrusive and transparent with respect to the interaction of the elder with his environment, the greater is the sense of autonomy and independence felt by the user.

In case of vocal interaction, a ubiquitous approach to the problem requires that the interaction take place as transparent as possible: no need to wear a microphone or to issue commands to specific “hotspots” of the dwelling. Absence of intrusive devices induces a more natural interaction model that generates in turn several positive aspects in chain:

Learning curve performance: Elderly user has no need to wear devices in order to perform human-computer interaction; he/she only need to ask someone to perform an action. In this way, the learning process will not present the typical high barrier that arouses when the user approaches an interaction model foreign to his user experience.

Technology acceptance: According to the Technology Acceptance Model [2], the two most important attitudinal factors in explaining acceptance and usage of a new technology are perceived usefulness and perceived ease of use. Perceived usefulness is described as “the degree to which a person

(23)

1.3. BigEar Application believes that using the particular technology would enhance his/her job performance”. Perceived ease of useis defined as “the extent to which a person believes that using a technology is free of effort”. Lacks in ease of use, in particular when dealing with elderly, can demotivate the user and bring him to consider a technology as useless or misleading. “Why do I have to wear a mobile phone for switching off the light? It would be easier if I could use a remote control directly”. Mastering of the living enviroment: As permanent medical devices can reduce the perceived quality

of life, so intrusive assistive technologies can be perceived as a gilded cage that lowers the sense of freedom and induces the user to feel a stranger in his own house. Ubiquitous and transparent systems can help elderly or injured people to feel masters of their home.

1.3 BigEar Application

Target of this work is to model, realize and implement a distributed speech capturing system that meets the expectations of ubiquitousness and transparency described above. Behind the acronym BigEar (uBiq-uitous wIreless low-budGet spEech cApturing inteRface) is an audio acquisition system built following Wireless Sensor Network [3] requirements:

Minimum cost: The adopted technology (hardware and software) has to consider the economical pos-sibilities of the people according to the paradigm: “a good but costly solution is not a solution at all”.

Wireless: The absence of power and/or signal cables is a strong requirement in order to lower costs for house adaptation. Moreover, wireless systems can ensure a higher degree of flexibility and configurability than wired systems.

Distributed: The key for pervasiveness is distributed computing; according to definition, [7, p. 2] the devices concur in building a result whose information content is more than the sum of single contri-bution. Moreover, sensors are completely independent and an eventual failure will not completely compromise the result of the sensor network collaboration.

Modular: The system should be implemented using a modular approach in order to be scalable and quickly configurable for matching environment characteristics and user’s needs.

Responsive: Speech recognition should be immediate, so speed of processing is a crucial requirement in order to give the user an immediate feedback; assistive domotic interaction has to be as fast as possible since the user don’t wait for Microsoft Window’s Hourglass.

1.4 Related Works

Literature focusing on Wireless Low-Cost Speech Capturing systems is limited. In general, the approach to the subject tends to emphasize one aspect (such as audio quality) at the expense of others (such as flexibility) and none of the examined projects can be considered as low cost.

(24)

S

WEET

-H

OME

Project

The SWEET-HOME project [15] aims at designing a smart home system based on audio technology

focusing on three main aspects: to provide assistance via natural man-machine interaction (voice and tactile command), to ease social e-inclusion and to provide security reassurance by detecting situations of distress. The targeted smart environments in which speech recognition must be performed thus include multi-room homes with one or more microphones per room set near the ceiling.

To achieve the project goals, a smart home was set up. It is a thirty square meters suite flat including a bathroom, a kitchen, a bedroom and a study; in order to acquire audio signals, seven microphones were set in the ceiling. All of the microphones were connected to a dedicated PC embedding an 8-channel input audio card.

Strength The approach to multi-channel audio alignment proposed by the paper is the same adopted in the BigEar Reconstruction Algorithm (Ch. 4); it is based on the alignment of each audio channel with the one with the highest SNR.

Weaknesses The use of a multichannel audio card requires dedicated hardware increases costs and reduces flexibility. Moreover, a wired approach is hard to implement since it requires to fix wires onto walls or into existing electrical pipes.

Wireless Sensor Networks for Voice Capture in Ubiquitous Home

Environ-ments

The work introduces a voice capture application using a hierarchical wireless sensor network: nodes in the same area belong to the same group of nodes (cluster); in each cluster there is a special node that coordinates the cluster activities (clusterhead). Each clusterhead collects audio data from their cluster nodes and relays it to a base station instantiated by a personal computer. Each node is instantiated by a MicaZ mote.

In order to capture audio signals from the environment, sensor nodes have a high sensitivity micro-phone, which is initially used to detect the presence of human voice in the environment; to do so, audio signal is continuously sensed using a sampling frequency of 2 kHz and if the sensed signal intensity exceeds a predetermined threshold, the node would send a notification through its wireless interface to the clusterhead, that in turn is asked to send a capture command to the sensor node. Then the sensor node enters into a High Frequency Sampling (HFS) mode to capture three seconds of audio and store each sample in the on-board EEPROM. Once the node finishes sampling, it would exit the HFS mode and would transfer the audio data to the clusterhead node which in turn would relay it to the base station where a more resourceful computer would process speech recognition tasks.

The High Frequency Sampling state comprehends shutting down the communications interface and sampling audio through the microphone using a sampling frequency of 8 kHz. The microphone in the motes is wired to a 10-bit resolution ADC, but only earlier 8 bit samples were used. It was imperative to shut down the communications interface to be able to sample at such frequency because both components

(25)

1.5. Structure of the Dissertation cannot be enabled simultaneously due to the fact that these nodes have a resource-limited microcontroller on-board.

For speech recognition, VR Stamp Toolkit from Sensory Inc. are used. The VR Stamp toolkit is an off the shelf tool that uses neural network-based algorithms for voice pattern matching. It consists of a daughterboard with an integrated RSC-4128 speech processor, 1 Mbit flash memory and a 128 kB serial EEPROM for data, this board is attached to the base station (PC) via USB port.

Strength The Low Frequency Sampling (LFS) state allows power saving and extends battery life. Weaknesses The system is not full real-time since speech recognition starts after that 3 seconds of

audio data are recorded and sent; MicaZ motes have limited lifetime since on-board EEPROM can bear a number of read/write operations that is far less than in RAM memory; The system is very expensive since each MicaZ mote costs about 80 $2_{, at which other costs have to be added for the}

microphones and for the VR Stamp toolkit.

1.5 Structure of the Dissertation

This work is organized as follows: in Chapter 2 the proposed Architecture of BigEar system is described, focusing on the characterization of the context of use, on the features of the Wixel prototyping board (core of the system) and on the communication protocol between the components of the system. The following Chapter illustrates the simplified model that allows to investigate the architecture capabilities by means of a simulation environment. The core of the system is represented by the BigEar Reconstruction Algorithm described in Chapter 4; in that chapter all of the methods used for reconstructing a speech signal starting from the audio data captured by each sensor are described. Then, a real-world working prototype of the system has been built following the guidelines and the design choices discussed in Chapter 5. The audio data captured by means of the BigEar Prototype have been compared with the ones generated by the simulator, and results of this comparison are discussed in Chapter 6. Finally, in Chapter 7 final remarks conclude the dissertation and look out over the future works.

(26)

(27)

2

BigEar Architecture

Introduction

In this chapter the architecture of the system is described. Basically, it is composed of a set of wireless audio sensors connected to a master receiver that collects data from the sensors and sends audio packets to the base station via USB. The base station is responsible for aggregating received data in order to rebuild the speech captured distributedly by the sensors.

2.1 Overview of the System

The system is composed of a network of audio sensors that capture audio in a room distributedly. The speech is sent to a main receiver, which basically acts as an interface that converts speech packets received via radio channel into serial data for being sent to the base station. The base station contains the application logic for handling speech packets. Since the audio sensors basically perform a space-time samplingof the audio inside a room, the application logic tries to reconstruct a good-quality speech stream starting from packets that arrive to the base station with different timestamps and with different physical characteristics. Indeed, each sensor samples the audio signal that reaches the microphone after undergoing variations due to the physical model of the environment: different delays and amplitudes that depend on the position of the person with respect to the sensors, and reflections/diffusions due to the geometry of the room and materials of the walls and furniture. Granularity of space-time sampling is influenced by:

• Number of audio sensors w.r.t the dimensions of the room: the bigger is the number of sensors spread in the room, the finer is the granularity of space sampling.

• Audio sensor internal characteristics and constraints: each sensor needs time in order to sample data (depending on ADC type), store them in buffers and send them to the main receiver.

• Network communication protocol characteristics and constraints: the number of packets sent to the main receiver is affected by the number of collisions that may happen on the channel and also

(28)

§ Sensor A § Sensor B § Sensor C § Sensor D § Sensor E § Sensor F § Sensor G u Receiver Core USB

Figure 2.1: Overview of the BigEar architecture

Figure 2.2: Wixel module pinout and components (From www.polulu.com/wixel)

by the protocols themselves (handshaking, request-response timings, timeslot allocations) In the following sections the different components of the BigEar architecture will be explained.

2.2 Wixel Programmable USB Wireless Module

Leaf nodes of architecture are represented by the sensors, that are built using Wixel Programmable USB Wirelesse Modules.

The Pololu Wixel is a general-purpose programmable module featuring a 2.4 GHz radio and USB port. The Wixel is built around the CC2511F32 microcontroller from Texas Instruments, which has an integrated radio transceiver, 32 KB of flash memory, 4 KB of RAM, and a full-speed USB interface.

The overall Wixel pinout and peripherals is depicted in Figure 2.2. A total of 15 general-purpose I/O lines are available, including 6 analog inputs, and the 2.54 mm (0.1") pin spacing makes the Wixel easy to use with breadboards and perfboards.

The Wixel can connect to a computer’s USB port via a USB A to Mini-B cable or a USB A to Mini-B adapter. The USB connection is used to configure the Wixel and also to transmit and receive data. The

(29)

2.3. Audio Interface USB connection may also provide power to the Wixel.

On the side of the board opposite the USB connector, the Wixel has a 2.4 GHz PCB trace antenna. This antenna, along with the other RF circuitry, forms a radio that allows the Wixel to send and receive data packets in the 2.4 GHz band. The CC2511F32 core makes the Wixel compatible with the CC2500 transceiver, the CC2510Fx family, and the CC2511Fx family of chips from Texas Instruments. The Wixel’s radio is not compatible with Wi-Fi, Zigbee, or Bluetooth. The antenna is a “meandered Inverted F” design that provides optimal performance despite its small size.

TI CC2511F32 Microcontroller Unit

The CC2511F32 is a true low-cost 2.4 GHz system-on-chip (SoC) designed for low-power wireless applications. The CC2511F32 combines the excellent performance of the state-of-the-art RF transceiver CC2500 with an industry-standard enhanced 8051 MCU, up to 32 kB of in-system programmable flash memory and 4 kB of RAM, and many other powerful features. The small 6x6 mm package makes it very suited for applications with size limitations.

The CC2511F32 is highly suited for systems where very low power consumption is required. This is ensured by several advanced low-power operating modes. The CC2511F32 adds a full-speed USB controller to the feature set of the CC2510F32. Interfacing to a PC using the USB interface is quick and easy, and the high data rate (12 Mbps) of the USB interface avoids the bottlenecks of RS-232 or low-speed USB interfaces.

2.3 Audio Interface

The analog stage that converts pressure waves into an electric signal and that conditions it for feeding the ADC has been designed in order to meet physical characteristics of a speech signal and to meet the constraints due to MCU features and capabilities. Pressure waves are captured by a small microphone that converts it into an electric signal whose amplitude is dependent on both the intensity of pressure wave and the sensitivity of the microphone itself. Then, the electric signal has to be amplified in order to make it readable (the output of the microphone is much smaller than the quantization step) and to minimize quantization error. Moreover, the signal has to be polarized since the ADC works between 0 and a positive voltage Vref while microphone output works symmetrically around 0.

2.3.1 Physical Characteristics of the Typical Situation

In telephony, the usable voice frequency band ranges from approximately 300 Hz to 3400 Hz [8]. It is for this reason that the ultra low frequency band of the electromagnetic spectrum between 300 and 3000 Hz is also referred to as voice frequency, being the electromagnetic energy that represents acoustic energy at baseband. The bandwidth allocated for a single voice-frequency transmission channel is usually 4 kHz, including guard bands, allowing a sampling rate of 8 kHz to be used as the basis of the pulse code modulation system used in telephony.

(30)

Here it is considered the simplest case of a single person speaking in a room. According to SPL tables [9], Sound Pressure Level (SPL) is considered of about 60 dB one meter away from the source. Typically sensors are attached to the ceiling of the room, so it can be assumed that they will be placed at an average height of 2.80 m above the floor. According to standard heights used in architecture [31, p. 109] the mouth of a person is considered to be located between 1.12 m and and 1.53 m above the floor, that gives an average vertical distance mouth-ceiling of 1.48 m. It has been taken in account that the person probably will not stay exactly below a sensor, so an arbitrary mouth-sensor distance of about 2 mhas been considered.

Sound in free field

A field is defined as free when all the reflections are negligible (e.g. an anechoic chamber, or a sound field region with no adjacent reflecting surfaces). In this situation, Sound Pressure Level halves every doubling of distance. From Hopkins-Stryker equation [22, p. 508]:

Lp= Lw+ 10log10 _Q 4πr2 | {z } Distance attenuation (2.1) where:

Lwis the level of source sound power,

Qits directivity

ris the distance between between source and receiver;

Evaluating the distance attenuation in Equation 2.1, it can be noted doubling r, the distance attenuation increases of about 6 dB. From these values, the effective pressure that reaches the microphone capsule can be calculated: Lp= 20 · log10 p p0 ⇓ p = p0· 10Lp/20 with p0= 20 µP a (2.2) Lp= 60 − 6 = 54 dB pspeech = 20 × 10−6P a · 1054/20= 10.02 mP a (2.3)

where p is the Sound Pressure Level expressed in P a and Lpis its corresponding value expressed in dB

(31)

2.3. Audio Interface Voice frequency bandwith from 300 to 3000 Hz (BW = 2700 Hz)

Speech intensity 60 dB @ one meter away from the mouth

Typical height of a room 2.80 m

Mouth height (seated - upright position) 1.12 - 1.53 m

Considered mouth-sensor distance 2 m

Speech pressure at 2 m 20 mPa up to 10.02 mPa

Table 2.1: Physical characterization of the Typical Situation

Semi-reverberant fields

A field is said to be semi-reverberant when it contains both free-field zones (near the source, where the direct sound prevails) and reverberant field zones (near the walls, where the reflected field prevails). Normally-sized rooms can be supposed as semi-reverberant fields. In this situation, Sound Pressure Level can be defined by the Hopkins-Stryker equation as:

Lp=Lw+ 10log10 _Q 4πr2 + 4 A (2.4) A =αS =X i αi· Si where:

Lwis the level of source sound power,

Qits directivity;

ris the distance between between source and receiver; αis the average absorption coefficient;

Sis the total interior surface

In a semi-reverberant acoustic field, the density of sound energy in a point is therefore given by the sum of the direct and indirect acoustic fields. Therefore, depending on acoustic absorption coefficients and size of the room, the overall attenuation could be much lower than 6 dB per distance doubling. So the target of the amplifier design will take account of this range of pressure values in order to provide the best signal to ADC stage. Table 2.1 summarizes the physical characteristics of the situation described above.

2.3.2 Sensitivity of the microphone

The sensitivity of a microphone is the electrical response at its output pins to a given standard acoustic input [16]. This is expressed as the ratio of the electrical otput to the input pressure. The standard reference input signal for microphone sensitivity measurements is a 1 kHz sine wave at 94 dB SPL, or 1 Pa. For analog microphones, the sensitivity is typically specified in units of dBV, that is, decibels with

(32)

reference to 1.0 V rms. Analog microphones sensitivity can be measured with mV/Pa units: SdBV = 20 ∗ log10

SmV /P a

OutREF

where OutREF = 1000 mV /P areference output voltage

(2.5) ⇓

SmV /P a= OutREF· 10SdBV/20 (2.6)

Microphones used for the prototypes have a sensitivity of -38 dB, so we can derive the sensitivity of them expressed in mV/Pa units:

SdBV = −38 dB

SmV /P a= 1000 mV /P a · 10−38/20= 12.59 mV /P a (2.7)

Using equations 2.3 and 2.7 we can assert that in presence of a typical speech signal, the microphone capsule will produce an output signal whose amplitude will be:

Vmic= pspeech· SmV /P a= 10.02 × 10−3P a · 12.59mV /P a = 0.126 mV (2.8)

2.3.3 ADC characteristics

The Analog-to-Digital Converter of CC2511F32 MCU, whose block diagram is depicted in Fig. 2.3, is capable of converting an analog input into a digital representation with up to 12 bits resolution. The ADC includes an analog multiplexer with up to six individually configurable channels and reference voltage generator. All references to VDD apply to voltage on the pin AVDD. The main features of the ADC are as follows:

• Selectable decimation rates which also sets the resolution (7 to 12 bits). • Six individual input channels, single-ended or differential

• Reference voltage selectable as internal, external single ended, external differential, or VDD. • Interrupt request generation

Time of conversion

The time required to perform a conversion, Tconv, depends on the selected decimation rate:

Tconv= (decimation rate + 16) ∗ τ (2.9)

where τ = 0.25µs.

When, for instance, the decimation rate is set to 128, the decimation filter uses exactly 128 ADC clock periods to calculate the result. When a conversion is started, the input multiplexer is allowed 16

(33)

2.4. BigEar Receiver and Base station

Figure 2.3: CC2511 ADC Block diagram

Dec. rate Resolution Time of conversion Max. allowed frequency

64 7 bits 16 µs 62.50 kz

128 9 bits 32 µs 31.25 kHz

256 10 bits 64 µs 15.63 kHz

512 12 bits 128 µs 7.81 kHz

Table 2.2: ADC: Speed of Conversion VS. Decimation Rate. It is assumed that conversions are per-formed always on the same channel, so the multiplexer does not need to allocate clock periods for changing input channel.

ADC clock periods to settle in case the channel has been changed since the previous conversion. The 16 clock cycles settling time applies to all decimation rates [12, p. 139].

Times required to perform a conversion in function of the decimation rates are summarized in Table 2.2. In the third column, maximum allowed sampling frequency with ADC decimation timings can be seen. It is worth considering the speed of conversion in connection with ADC resolution since it represent an important design choice. The higher is the resolution, the lower is the quantization error. On one hand, according to table 2.1, low resolutions allow high sampling frequency and so a higher temporal resolution with the drawback of a lower number of quantization levels; on the other hand, high resolutions reduce quantization error granularity. It is an important key factor in signal post-processing. ADC Resolution and Sampling Frequency will be deeply explained in sections 5.2.1 and 5.3.1.

2.4 BigEar Receiver and Base station

The Base Station is the device that collects data from the sensors and arranges packets in order to produce a clear and intelligible speech signal. It consists of one Wixel connected via USB port to a machine that runs the application logic.

The only task of the Wixel (BigEar Receiver) is to act as a wireless-USB interface (and vice versa). It mainly receives radio packets from the sensors, transforms them into hexadecimal nibbles and sends

(34)

USB

Module

Nibbles to Bytes

Bytes to Nibbles _RADIO

Module

HEX Nibbles Radio packets

BigEar receiver Calculator § BigEar Audio Capture boards

Figure 2.4: BigEar Receiver Logic

them to the machine via USB port. When the computer has the need to send commands to the sensors (or to reply to protocol messages), BigEar Receiver receives hexadecimal nibbles through the USB port, converts them in bytes and sends them using the radio module of CC2511 SoC (Fig. 2.4).

Wixel Programmable Module has been programmed in order to handle up to 256 different radio channels [6]. All of the sensors share the same channel used by the Wixel receiver, so if the network architecture requires channel separations (e.g. for reducing number of collisions), a second Wixel con-nected to another USB port of the machine is needed.

The application logic (illustrated in Figure 2.5) that runs on the machine basically receives radio packets (Fig. 2.5a) containing each one a set of bufferized audio samples tagged with a timestamp and the sensor ID; for each sensor, audio samples are arranged according to their timestamp. In this, for each sensor it is obtained a coherent but incomplete stream: coherent because audio samples are in the right time position with respect to the sensor timestamp, incomplete because there is no guarantee that the transmission time is less than or at most equal to the sampling time. Conversely, the procedures for elaborating, buffering and sending samples require a time overhead that does not allow a real-time transfer rate.

Once the samples have been arranged, the application perform a time delaying-or-advance of the audio streams (Fig. 2.5b) coming from the sensors in order to remove the delays caused by the different distances between the mouth of the user and the sensors. Therefore, in-phase audio contributions are obtained; they can be summed each other in order to produce a seamless stream. (Fig. 2.5c).

During the alignment process it is needed to take account of different energy contribution of the sensors: the closer is the sensor to the user, the bigger will be the signal amplitude and vice versa. Moreover, the sum of contributions will be weighted in order to prevent amplitude distortions on the resulting signal.

2.5 Network Protocols

Network protocols have a big impact on the efficiency of the whole system: collisions, granularity of the network of sensors and presence of protocol messages can affect the number of audio packets transmitted and received successfully.

(35)

2.5. Network Protocols

Source signal

Received samples

(a) Data received from sensors cannot be summed directly

Source signal

Sensor n. 1

Sensor n. 2

Sensor n. 3

(b) Audio packets are arranged and aligned

Source signal

Reconstructed signal

(c) Audio packets are summed into a unique stream Figure 2.5: BigEar Application Reconstruction Logic

(36)

the same time the number of service messages for implementing synchronization mechanisms have to be increased in order to reduce the number of collision. This can be done at the expense of the channel availability and software complexity. CC2511 does not offer either Carrier Sense nor Collision Detection, so the scenario can be seen as a pure broadcast domain. Moreover, each Wixel receives everything is being transmitted on its channel, and each transmission reaches every Wixel that is listening on the same radio channel.

2.5.1 ALOHA Protocol

The simplest protocol that can be adopted in this scenario is Aloha [1]; This name refers to a simple communications scheme in which each source (transmitter) in a network sends data whenever there is a frame to send. Aloha specifies also that if the frame successfully reaches the destination (receiver), the next frame is sent; otherwise if the frame fails to be received at the destination, it is sent again. Since audio data are time-dependent, for BigEar application purposes it is worthless to retransmit audio packets, so the transmitter-side application will not wait for any acknowledgment from the base station. Advantages:

• It doesn’t need any protocol or service message. • Adapts to varying number of stations.

Disadvantages:

• Theoretically proven throughput1_{maximum of 18.4%.}

By making a small restriction in the transmission freedom of the individual stations, the throughput of the Aloha protocol can be doubled. Assuming constant length packets, transmission time is broken into slots equivalent to the transmission time of a single packet. Stations are only allowed to transmit at slot boundaries. When packets collide they will overlap completely instead of partially. This has the effect of doubling the efficiency of the Aloha protocol and has come to be known as Slotted Aloha [20] (Fig. 2.6).

BigEar application has been tested using pure ALOHA protocol (without acknowledge) in order to exploit and examine system capabilities with the simplest communication protocol.

1_{Number of packets successfully (without collision) transmitted per unit time, where with unit time we consider the (fixed)}

(37)

2.5. Network Protocols

(38)

(39)

3

BigEar Modeling and Simulation

Introduction

In this chapter the methods and structures used for simulating the entire architecture within its envi-ronment are described. Dissertation starts illustrating the physical model of the propagation of a sound within a room and tools used for simulating it, then it continues describing a model that simulate the behavior of the network of sensors arranged in the room.

3.1 Overview

Figure 3.1 shows a schematic representation of BigEar simulator. It is composed of four interconnected modules, at the end of which it is possible to withdraw the reconstructed signal.

The Audio model block consists in the simulation of the environment: room dimensions (and op-tionally other parameters such as reflection and diffraction coefficients of walls), location of the sensors and of the audio source are set. Then an audio file is provided as input. The block produces as many audio streams as there are sensors. Each stream differs from the others in terms of amplitude, delay and diffusion, which change depending on the acoustics characteristics of the room and the positions of the sensor and the source.

The second block, Radio network model, simulates the behavior of transmitters according to a choosen network protocol. In the simplest case of ALOHA protocol (See 2.5.1), it simulates random instants of transmission of each transmitter and tags each packet as received or collided. This kind of information is important because it affects the behavior of the following block that models the internal buffer structure of the sensors.

N-buffer modelblock implements the buffering system internal to each transmitter. In real world, data are continuously sampled and buffered by each transmitter in order to be ready to send them when needed; during the simulation, instead, the time instants in which transmission occurs are known, but it is needed to model the buffering structures in order to know what are the data ready for being transmitted. This block produces in output the audio samples packed as if they were coming from real transmitters:

(40)

Audio Model Fs Room Source Sensor positions N-Buffer model

Radio network model

Signal Reconstruction

block

Model and Sensors parameters Reconstructed signal Sensor Model Audio streams Output packets Radio timings _and Collision flags

Figure 3.1: Architecture model

each packet contains the ID of the transmitter, the timestamp of the first sample1 _{of the packet and a}

number of samples that corresponds to the frame size of each buffer.

The fourth block has been reported here for completeness but it will be discussed in chapter 4. It concerns the reconstruction of the source signal starting from output of the previous block.

3.2 Audio Model

In order to create an audio model of the typical use case it is needed to do some assumptions about the typical environment where the system will work. The room model assumed is a rectangular enclosure, a three-dimensional space bounded by six surfaces (walls, ceiling and floor). In section 3.2.2 this type of room will be also called “empty shoebox”. Each room surface has its own absorption and scattering (diffusion) coefficients. The scattering coefficients set the ratio as a function of frequency between the amount of specular and diffuse reflection. Sound scattering due to furniture and other objects in the room can be approximated by higher levels of overall room diffuseness. This type of room has been chosen because it is the most common shape of room, and also because of its ease of modeling and computability.

Other assumptions concern directivity of source and sensors. It is assumed that directivity of the source can be negligible because in small-medium rooms the reflections give a big contribution to the diffusion of the sound. Small microphone capsules have omnidirectional polar pattern so directionality factors can be neglected.

1_{Timestamps of other samples can be inferred summing τ}

i = i ·_F1_s where i is the (0-based) the index of ithsample in the

(41)

3.2. Audio Model

x1

x2

r12

Figure 3.2: Geometry of wave propagation from a point source x1to a listening pointx2.

3.2.1 Physical Model

A point source produces a spherical wave in an ideal isotropic (uniform) medium such as air [24]. To a good first approximation, wave energy is conserved as it propagates through the air. In a spherical pressure wave of radius r, the energy of the wavefront is spread out over the spherical surface area 4πr2_.

Therefore, the energy per unit area of an expanding spherical pressure wave decreases as 1/r2_{. This is}

called spherical spreading loss. Since energy is proportional to amplitude squared, an inverse square law for energy translates to a 1/r decay law for amplitude.

The sound-pressure amplitude of a traveling wave is proportional to the square-root of its energy per unit area. Therefore, in a spherical traveling wave, acoustic amplitude is proportional to 1/r , where r is the radius of the sphere. In terms of Cartesian coordinates, the amplitude p(x2)at the point x2 =

(x2, y2, z2)due to a point source located at x1= (x1, y1, z1)is given by

p(x2) =

p1

r12

where p1is defined as the pressure amplitude one radial unit from the point source located at x = x1,

and r12denotes the distance from the point x1to x2:

r12, k x2− x1k =

p

(x2− x1)2+ (y2− y1)2+ (z2− z1)2

This geometry is depicted for the 2D case in Fig. 3.2.

In summary, every point of a radiating sound source emits spherical traveling waves in all directions which decay as 1/r , where r is the distance from the source. The amplitude-decay by 1/r can be considered a consequence of energy conservation for propagating waves. (The energy spreads out over the surface of an expanding sphere.) We can visualize such waves as “rays” emanating from the source, and we can simulate them as a delay line along with a 1/r scaling coefficient (see Fig. 3.3).

x(n) z−M

1/r

y(n)

Figure 3.3: Point-to-point spherical wave simulator. In addition to propagation delay, there is attenua-tion byg = 1/r.

(42)

fi

θi

fr

θr

Figure 3.4: The law of reflection states that the angle of incidence θiof a wave or stream of particles

reflecting from a boundary, conventionally measured from the normal to the interface (not the surface itself), is equal to the angle of reflectionθr, measured from the same interface.

h

S L

d

r r

Figure 3.5: Geometry of an acoustic reflection caused by “multipath” propagation. A direct signal and a “floor bounce” are received from the source S at the listening point L

Reflection of Sound Waves

When a spreading spherical wave reaches a wall or other obstacle, it is either reflected or scattered. A wavefront is reflected when it impinges on a surface which is flat over at least a few wavelengths in each direction.2 _{Reflected wavefronts can be easily mapped using ray tracing, i.e., the reflected ray leaves at}

an angle to the surface equal to the angle of incidence (see Fig. 3.4). Wavefront reflection is also called specular reflection(especially when considering light waves).

A wave is scattered when it encounters a surface which has variations on the scale of the spatial wavelength. A scattering reflection is also called a diffuse reflection.

The distinction between specular and diffuse reflections is dependent on frequency. Since sound trav-els approximately 0.34 m per millisecond, a cube 0.34 m on each side will “specularly reflect” directed “beams” of sound energy above 1 kHz, and will “diffuse” or scatter sound energy below 1 kHz.

(43)

3.2. Audio Model Room Acoustics

The pressure wave that reaches the listening point has to be considered as the superposition of the direct sound and the reflected waves due to “multipath” wave propagation.

Each contribute will have different delay and amplitude, since these quantities are dependent on the distance traveled by the wave. In Fig. 3.5 a simple example is depicted, considering only direct wave and one reflection. The acoustic source is denoted by ‘S’, the listener by ‘L’, and they are at the same height h meters from a reflecting surface. The direct path is d meters long, while the length of the single reflection is 2r meters. These quantities are of course related by the Pythagorean theorem:

r2= h2+ d 2

2

DIRECT SOUND reaches the listener delayed by ∆d = d/c [s] with c speed of propagation and

smoothed by a gain factor gd = 1/d, due to the SL = d distance, the speed of propagation and the

spherical spreading loss described at the beginning of this section.

FIRST REFLECTION reaches the listener delayed by ∆r = 2r/c [s]with r = ph2+ (d/2)2 and

smoothed by a gain factor gr= 1/2r.

This process should be iterated for every reflection, obtaining a set of delay coefficients ∆iand gain

coefficients gi. Every (∆i, gi)couple represent a delayed and smoothed version of the original pressure

wave.

In digital domain, after converting the pressure wave into an electric signal and sampling it at a certain frequency Fs, the pressure wave is rendered by a sequence of numbers x(n). (∆i, gi)contributions, in

digital domain, can be condensed in a finite impulse response sequence h(n) in which ith_{stem represents}

the ith_(∆

i, gi)couple: Height of stem is given by gi, and the position along x axes of the stem is given

by ∆i. According to the signal theory, in order to obtain the resultant sound perceived by the sensor at

position L we convolve the source signal x(n) with the room impulse response h(n): y(n) = x(n) ⊗ h(n) =

∞

X

k=−∞

x(k) · y(n − k) (3.1)

Calculations for the first-order reflections can be easily done, but the resulting impulse response will be very poor since it will contain only 7 contributions: one for the direct sound, 6 for the first-order reflection of the sound against the walls. In order to perform a richer model it is needed to adopt the image source method. In the image source method the sound source is reflected at each surface to produce image sources which represent the corresponding reflection paths. The next section will explain operating principles of this method, whose representation is illustrated in Fig. 4.11.

3.2.2 MCROOMSIM toolbox

Multichannel Room Acoustics Simulator (MCROOMSIM) [30] is a simulator which can simulate record-ings of arbitrary microphone arrays within an echoic shoebox room. This software is written in the C 2_{To determine whether a surface is effectively flat, it may first be smoothed so that variations less than a wavelength in size are}

(44)

(a) 0thorder image (direct sound) (b) 1 st

order images (reflections of S over wallsa and b)

(c) 2nd

order images (reflections ofSaoverb

andSbovera) (d) Sum of 0

th

,1stand2ndorder reflections

(45)

3.2. Audio Model programming language. The software is freely available from the authors.

MCROOMSIM simulates both specular and diffuse reflections in an empty shoebox environment. Effects of air absorption and distance attenuation on sound are also modeled by the simulator. The empty shoebox room geometry simplifies the necessary calculations used during simulation, which in addition to the simulator being implemented in C, enables a significant reduction in the computation time of room impulse responses, resulting in a fast simulation package. The simplification of the room geometry to a shoebox model is suitable when, as in this case, the importance is in creating realistic room impulse responses for evaluating threedimensional audio applications in echoic environments, rather than modeling the architectural complexities of the room.

MCROOMSIM is designed to interface with MATLAB. The simulator receives all setup data from MATLAB. Once the simulation has completed, MCROOMSIM returns the simulated room impulse responses for all configured source and receiver combinations back to the MATLAB workspace. A set of high level MATLAB functions are provided to configure the simulator, making it easier for the user to setup the room, sources and receivers with specific properties, as well as configuring specific properties of the simulation routines. Figure 3.7 illustrates the dataflow of the toolbox.

Room simulation method

Impulse responses are simulated between each source and receiver combination configured in the echoic room. MCROOMSIM uses two geometrical acoustic simulation algorithms to generate the impulse response between a source and receiver, with each algorithm simulating a different part of the reverberant sound field. The main or specular reflections are simulated using the image source algorithm [4] and the diffuse sound field is simulated using the diffuse rain algorithm [11,21]. The output from both algorithms are combined to make the final output response.

Image Source algorithm The image source algorithm provides accurate direction and timing of the main reflections. The algorithm starts with the zero-th order sound which corresponds to the direct sound. For first order reflections, the sound source is mirrored in the room’s surfaces to create a series of virtual sources, which are then traced back to the receiver along with the inclusion of distance attenuation and air absorption. These virtual sources are then expanded to create more virtual sources of higher order. By knowing the distance of the virtual source from the receiver and absorption of the surfaces that it reflects off, the contribution of the corresponding sound propagation path to the room impulse response can be calculated exactly. This contribution and the receiver’s directional response are then convolved, with the output being the room response to that propagation path as ‘recorded’ by the receiver. The image source algorithm continues to expand the order of virtual sources up to an maximum order set by the user or ceases when the energy of all virtual sources of the same order drop below a predefined threshold. Diffuse rain algorithm The diffuse rain algorithm is a fast stochastic ray tracing method. It aims to provide a good approximation of the diffuse sound field in the room. This simulation technique models the propagation of sound from the source to the receiver by a series of discrete rays that are traced around

(46)

Proceedings of the International Symposium on Room Acoustics, ISRA 2010 29–31 August 2010, Melbourne, Australia the size, frequency dependent absorption and scattering

coefficients of the walls/ceilings, room temperature and humidity.

2. General simulation options: here the user is provided with an opportunity to configure various features of the simulator such as the maximum order that the image source algorithm will iterate up to, the minimum energy threshold for virtual image sources or rays, etc. 3. Receiver setup: the number of receivers to simulate, their

locations, orientation, number of channels and directivity. For receivers with custom directivity, a list of directional gains or impulse responses along with the matching direction list is also included here.

4. Source setup: same as the receiver setup with the limita-tion that sources can only be single channel.

Once the MCROOMSIMfunction is invoked, MATLAB

com-bines all of the data, including all user-defined directional re-sponses (if provided), into a single structure which it then passes to MCROOMSIM. Once the simulation has completed, all of

the room impulse response data are provided as output to the MATLABworkspace. The output of MCROOMSIMis a time domain room impulse response for each source and receiver combination. In the case of a multichannel receiver array, a separate response is provided for each channel.

Flow Of Software

Once the input configuration data is received by MCROOMSIM,

the simulator first performs a validity check on the parameters to ensure that the configuration is not erroneous. Once this has completed successfully, the simulator uses these parameters to configure itself. All operations are performed in the frequency domain, which improves efficiency of convolutions. Therefore if a source or receiver has custom directivity defined with di-rectional impulse responses, MCROOMSIMtransforms these responses into the frequency domain. Next the simulator es-timates the maximum possible length for the room impulse response by calculating the longest time for the virtual sources to decay below the energy threshold, as set in the simulation general options. This is performed so that simulation memory can be pre-allocated.

MCROOMSIMcan be configured to run in one of three ways:

specular only simulation, diffuse only simulation or both. The simulation algorithms are then executed to generate the room impulse responses between the defined sources and receivers. Each algorithm outputs its results in the time domain, so there is no need for transformation. If both algorithms have been executed then the outputs of each will be combined to create the overall room impulse responses. Upon completion, the resulting simulated room impulse responses are provided as output to the MATLABworkspace. Fig. 1 highlights the flow of the software

from configuration to the end of simulation.

EXAMPLE APPLICATIONS OF THE SOFTWARE

Realistic in-room microphone array simulation Microphone arrays are used in many three-dimensional audio applications. Some examples include spatial sound reproduc-tion [2], measurement and analysis of direcreproduc-tional properties of reverberant sound fields using beamforming techniques [8] or plane wave decomposition [9]. In all of these situations MC-ROOMSIMcan be used to generate realistic three-dimensional room impulse responses for the microphone array under test. All that is needed are the anechoic directional impulses responses of the microphone array and the properties of the room under simulation, i.e. size, location of source(s), location of micro-phone array, etc. Moreover, sources with specific directivity patterns can also be simulated in the system, enabling the user

Configure Room (size, absorption, diffusion, etc.)

Configure Room

(size, absorption, diffusion, etc.)

Configure Sources

(location, directivity, etc.)

Set Simulation Options

(thresholds, simulation order, etc.)

Configure Receivers

(location, directivity, channels, etc.)

Generate Setup Structure

MATLAB

Configure Room

(validate input parameters)

Estimate Response Length

Simulate specular reflections? Simulate diffuse reflections? Configure Room (size, absorption, diffusion, etc.)

Simulate Specular Refl s

(Image Source Algorithm) n

Simulate Diffuse Refl s

(Diffuse Rain Algorithm) n

Yes Yes

Combine Specular & Diffuse Room Responses

MCROOMSIM

Output: Room Impulse Responses

(One response per source/receiver combination)

MATLAB

Process Custom Directivities

(Perform FFT on directional impulse responses)

Figure 1: The flow of the MCROOMSIM software is shown

from the configuration stage to the end of simulation, where the simulated room impulse responses are returned to the MATLAB

workspace.

to compare the effects of various sources on the reverberant sound field.

3D Binaural rendering of echoic sound scenes MCROOMSIMcan be used to directly simulate realistic Bin-aural Room Impulse Responses (BRIRs). This is done by con-figuring the simulation software with a pair of sensors whose directional impulse responses are anechoic head related impulse responses (HRIRs) measured for a number of directions in the free-field for the particular user. This is useful for quickly gen-erating 3D BRIRs for many environments, without the need of performing time consuming recordings within each environ-ment. From this the user could then for instance listen to the 3D reverberant sound fields created by sources with different directivities.

Figure 3.7: MCROOMSIM Data flow from configuration stage to the end of simulation

BigEar : a low cost wireless distributed audio capturing system for ubiquitous automatic speech recognition

POLITECNICO DI MILANO

Como Campus

BigEar: A LOW COST WIRELESS DISTRIBUTED AUDIO CAPTURING

SYSTEM FOR UBIQUITOUS AUTOMATIC SPEECH RECOGNITION

Advisor: Prof. Fabio Salice

Tutor: Eng. Hassan Saidinejad

Master Graduation Thesis of:

Stefano Gorla, ID 745603

POLITECNICO DI MILANO

Polo Regionale di Como

BigEar: SISTEMA A BASSO COSTO, WIRELESS, DISTRIBUITO DI

ACQUISIZIONE AUDIO PER IL RICONOSCIMENTO AUTOMATICO DEL

PARLATO

Relatore: Prof. Fabio Salice

Tutore: Ing. Hassan Saidinejad

Tesi di Laurea di:

Stefano Gorla, matricola 745603

Acknowledgements

Abstract

T

Sommario

L’

Contents

List of Figures

List of Tables

1

Introduction

Q

1.1

BRIDGe - Behaviour dRift compensation for autonomous and

InDe-pendent livinG

1.2

Ubiquitous Computing

1.3

BigEar Application

1.4

Related Works

S

-H

Project

Wireless Sensor Networks for Voice Capture in Ubiquitous Home

Environ-ments

1.5

Structure of the Dissertation

2

BigEar Architecture

Introduction

2.1

Overview of the System

2.2

Wixel Programmable USB Wireless Module

TI CC2511F32 Microcontroller Unit

2.3

Audio Interface

2.3.1 Physical Characteristics of the Typical Situation

2.3.2 Sensitivity of the microphone

2.3.3 ADC characteristics

2.4

BigEar Receiver and Base station

2.5

Network Protocols

2.5.1 ALOHA Protocol

3

BigEar Modeling and Simulation

Introduction

3.1

Overview

3.2

Audio Model

3.2.1 Physical Model

3.2.2 MCROOMSIM toolbox

EXAMPLE APPLICATIONS OF THE SOFTWARE