ROOM ACOUSTICS MEASUREMENTS AND AURALIZATION

(1)

IACMA – International Advanced Course on Musical Acoustics Bologna, Italy, July 18-22, 2005

ROOM ACOUSTICS MEASUREMENTS AND AURALIZATION

Lamberto Tronchin

⁽¹⁾

, Angelo Farina

⁽²⁾

(1) DIENCA - CIARM, University of Bologna Viale Risorgimento, 2 - 40136 Bologna, Italy

Email: lamberto.tronchin@unibo.it URL: www.ciarm.ing.unibo.it

(2) Industrial Engineering Dept., University of Parma Via delle Scienze 181/A, 43100 Parma, Italy

Email: angelo.farina@unipr.it URL: www.angelofarina.it

Abstract

This document is divided in two parts: objective assessment of the acoustical quality of a room based on measurements, and perceptual assessment of the musical listening experience in a room based on the auralization technique.

The basic quantity measured inside a room is its Impulse Response. Therefore, the concept of Impulse Response is explained, and its limits clearly stated. Then it is shown what are the best techniques for measuring Impulse Responses in typical rooms employed for musical performances, and how the concept of Impulse Response can also be applied for evaluating recording and reproduction rooms and equipment employed for the electro acoustical delivery of live or recorded music. Similarly, the electro acoustical devices employed in room acoustic measurements (sound sources, microphones, digital playback/recording equipment, software) are surveyed.

Finally, the usage of measured (or computer-simulated) impulse responses for performing

listening test through the auralization technique is presented, in its various technical

embodiments. The realization of a virtual listening room is presented, and the pros and cons

of each recording/reproduction approach are comparatively evaluated.

(2)

1 INTRODUCTION TO DIGITAL SOUND PROCESSING

Both when employing advanced measurement techniques or performing listening tests employing modern digital recording/reproduction equipment it is important to have a solid and simple grasp of the basic technology which makes it possible to process digitally the sound signal.

Although this knowledge is nowadays widespread, and people is used to digital sound and music since the childhood, thanks to technologies such as CD players, GSM cellular phones and MP3 music players, it is advisable to present here a very quick and basic explanation of the processing.

This chapter has also the goal to explain the internal working of some devices which will be later employed during measurements and auralization, such as microphones, analog-to-digital converters, etc. The digital signal processing section contain very basic, but up-to-date information about manipulation of digital audio on modern platforms.

1.1 Nature of the sound field

The sound is a complex thermofluidodynamic phenomenon occurring in fluids and solids, which involves motion of the “particles” around their steady position (and hence the concept of “particle velocity”) and fluctuation of the density and pressure of the medium (and hence the concept of “acoustical pressure”, which is the difference between the absolute pressure and the long-term average pressure of the unperturbed medium).

Usually the human body is submerged in air, and the sound is perceived by the human being as an air-transmitted stimulus. Various parts of the human body are sensitive to the acoustic field, (ears, skin, chest, stomach, etc.), and the human sensory system can detect both the particle velocity and the acoustical pressure. It must be noted that the acoustical pressure is a scalar quantity, and does not involve any directional information, whilst the particle velocity is a vector, and carries the information of the direction of propagation of the sound.

Although in very simple cases there exist analytical formulas relating particle velocity and acoustical pressure, in most real-world cases none of such simple relationships hold. Most acoustics textbooks only explore satisfactorily these very simple cases, and leave the impression that complex, real-world cases can be explained as superposition of these basic cases. Although this is in general true for the acoustical pressure field, this is not the case for the particle velocity field, and, more importantly, for the relationship between acoustical pressure and particle velocity.

This relationship can be expressed in two ways:

o The product between acoustical pressure p and particle velocity v is the Sound Intensity i:

    p       v 

i (1)

o The ratio between acoustical pressure p and particle velocity v is the impedance z:

    p      / v 

z (2)

(3)

Abandoning the usual limitations encountered in Acoustics textbooks (steady- state periodic signals, etc.), both the instantaneous intensity signal i() and the impedance ratio z() are variables, and only in very particular cases these quantities have constant values and simple mathematical expressions. On most textbooks, the time average of the instantaneous intensity, named I, is considered constant, and similarly also the average value of the impedance ratio, named Z, is considered constant. For proper handling of real-world cases, none of the above assumptions will be required here. Instead, we can consider that, in general, acoustical pressure and particle velocity signals are completely unrelated, independent physical quantities, and that faithful recording and reproduction require capturing and recreating independently both of them.

1.2 From signals to numbers

The conversion form the physical quantities known as sound pressure and particle velocity to a completely-numerical description of them is obtained by a chain of subsequent devices.

1.2.1 Microphones

The first stage is the existence of physical “transducers”: a microphone is a transducer transforming the acoustical quantity in electrical signals. As we already noted, in air the physical quantities are generally two (sound pressure and particle velocity), whilst the electrical signals can be voltage (Volts), current (Amperes) or charge (Coulombs).

Regarding the first fact, we have basically “pressure microphones”, which do

transduce the acoustical pressure in a corresponding proportional electrical quantity,

and “velocity microphones”, which similarly transduce the “particle velocity” (or,

more precisely, the Cartesian component of the particle velocity along a well-defined

axis) into a corresponding proportional electrical signal. Some microphones,

however, are “hybrid”, as they react both to acoustical pressure and to particle

velocity, with a various “mix” of sensitivity to these physical quantities. This

translates usually in a different directivity pattern of the mike, as shown in the

following table:

(4)

Name Directivity Sound Pressure Sensitivity

Particle Velocity Sensitivity

Omnidirectional 100 % 0 %

Subcardioid 75 % 25 %

Cardioid 50 % 50 %

Hypercardioid 25 % 75 %

Figure-of-Eight 0 % 100 %

Some microphones are actually built employing a dual-diaphragm assembly:

this makes it possible to vary the “mix” of pressure and velocity sensitivity acting on an electrical control device, usually a knob or a rotary dial, which enables the user to vary the directivity pattern of the microphone. The following figure shows one of these variable-directivity microphones, manufactured by Neumann, whilst similar devices are built also from competitors such as Schoeps, Sennheiser, etc.

41 42 43 44 45 46 47 48

49 505152

53 6669545556575859606162636465676870717273747576777879 80

81 82 83 84 85 86

87 88

89 90 91 92

93 949596979899 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116117118119120121122123124125126127128129 130

131 132 133 134 135 136 137 138 139 140 141 142

143 144 145 146147 148

149 150 151 152 153 154

155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193

194195 196

197 198 199 200 201 202 203 204 205 206218207208209210211212213214215216217219220221222 223

224 225 226 227 228 229 230 231 232 233234235236237238239240

241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 36010111213141516171819202122232425262728293031323334353637383940123456789

0 330

300

270

240

210 180 150 120 90

60 30

0 330

300

270

240

210 180 150 120 90

60 30

0 330

300

270

240

210 180 150 120 90

60 30

0 330

300

270

240

210 180 150 120 90

60 30

(5)

Figure 1 – Neumann U89i variable-pattern microphone

The electrical signal (Voltage, Current or Charge) is always conditioned and converted in true-voltage by means of electronic circuitry, nowadays usually embedded inside the microphone body. So, in practice, from the microphone body the electrical signal is always output in form of a Voltage signal, and consequently the absolute sensitivity of the mike can be expressed in mV/Pa (milliVolts per Pascal), in the case of a purely-pressure microphone, or mV/ms

^-1

(milliVolt per meter/second) in the case of a purely-velocity microphone. But in practice, as most people is trained not to think in terms of particle velocity, also for velocity microphones (or hybrid ones) the sensitivity is usually improperly expressed as mV/Pa, assuming the special case of a plane, progressive waves (one of those special, simple cases always employed in textbooks), for which the ratio between acoustical pressure and particle velocity Z is constant, and equal to ·c, which in air is usually equal to approximately 400 Rayl (the unit of Impedance, kg/m

²

s). So, for example, the sensitivity of the Neumann U89i is declared as 8.0 mV/Pa, even when the unit is operated as a pure figure-of-eight mike, in which case the “real” sensitivity should be expressed as 3200 mV/ms

^-1

.

1.2.2 Cables

Now we have a Voltage signal, which can be manipulated in various ways.

The signal can be amplified, so that the signal amplitude of a few milliVolts is boosted to several Volts, making it possible to transmit it over long cables.

However, it is not common to see a significant voltage boost directly at the microphone assembly. Instead, some sort of cabling always interconnects the mike with the preamplifier.

A trick often employed for transmitting a weak signal over long cables without too much noise contamination is to use a “balanced” connection, which means employing a double signal cable surrounded by a ground screen. The signal is sent on both of the internal cables, but in one of them it is “positive” or “hot” (a pressure above the average air pressure is transferred into a positive-voltage electrical signal), whilst on the other wire it is polarity-reversed, also called “negative” or

“cold”. At the end of the cable, the preamplifier or the recording device will extract a

(6)

signal which is given by the difference between the voltage detected across the pair of wires, and this simple fact rejects any contaminating noise which could have been entered the long cable.

Figure 2 – Balanced audio cables with XLR connectors (3 pins)

Most professional microphones and preamplifiers/recording devices are equipped with balanced connections, whilst low-end consumer, or high-end audiophile systems are usually “unbalanced”, and hence very sensitive to the quality of cabling. For consumer systems this choice is done for reducing the cost (as truly balanced input stages can cost 10 to 20 USD per channel, which is considered too expensive for a cheap consumer device), whilst for audiophile devices (which, costing thousands of dollars, should not have any problems accommodating the cost of high quality balanced input stages) the choice of unbalanced connections is made just for making room for “audiophile cables”, which again cost thousands of dollar, and would be absolutely useless if balanced connections were employed for interconnecting all the equipment.

1.2.3 Preamplifier

We are now at the fourth stage of the recording chain: after what’s happing in the air, after the microphone, and after the cable, we usually find a preamplifier. In theory, the unique goal of a preamplifier is to boost the voltage generated by the microphone, bringing it at a level which is appropriate for the following equipment.

In practice, however, very often the preamplifier also contain additional processing, which can either be linear (such as frequency-band limitation obtained by high-pass and low-pass filtering sections, often with switchable frequency limits) or notlinear, such as compression, soft-limiting, automatic gain control, squelch, etc.

The presence of band-limiting function is inherently required by the electrical connection of microphones which require “phantom” power supply, obtained by means of a DC offset of both the “hot” and “cold” wires referred to the ground. This DC offset is typically set to 48 V for most microphones, and is employed for powering the electronics embedded inside the microphone body. The DC component needs to be decoupled from the audio signal, and usually a condenser is employed for this purpose, resulting in a gentle high-pass filter with a -3 dB cutoff typically around 4 to 10 Hz. This is of course of no concern for signals to be listened by humans, which are substantially deaf to sound having frequency below 16 Hz.

Regarding the high frequency limit, modern equipment is usually configured

(7)

for very high frequency response, well in excess of 40 kHz, even if there is no proof that humans can ear anything above 20 kHz. It is thought, however, that the capability to handle very high frequencies causes the system to be more “transparent” to sharp attacks or pulsive sounds. The reality, instead, is that the capability to follow such sharp transients is not given by the extension of the pass-band, but by the temporal response of the low-pass filter: The smoother and softer is the frequency response of the filter, the shorter will be its response in time domain, and consequently the system will react promptly and without unwanted “ringing” in time domain. This is typically obtained with a soft low-pass, starting to roll off just above 20 kHz (typically around 24 kHz), but with a gentle slope, falling at -100 dB only above 40 kHz. The optimal shape of such a lowpass filter has been studied theoretically in [1], and is called an

“apodising” filter.

Figure 3 – 2-channels microphone preamplifier

The presence of not-linear sections in a microphone preamplifier is usually dictated by the need to not worry too much about accurate regulation of the gain of the preamplifier, making sure that, even if an unexpected sound comes very loud, the electronics will not going in clipping, causing nasty artifacts to the sound. This is a serious reason for employing such not-linear processors; but they can also be the cause of very dangerous measurement errors when employing the audio recording chain for room acoustics measurements. It is therefore recommended that, when the preamplifier is employed for measurements (opposed to recording live music), any notlinear section of the preamplifier is systematically excluded.

In the following chapters it will discussed in great detail the kind of artifacts which arise when any part of the measurement chain exhibit notlinear behaviour.

1.2.4 ADC (Analog to Digital Converter)

And finally, the signal can be converted from “analog” to “digital”, by means of a specific device (or a chip embedded in the preamplifier) called ADC (analog-to- digital converter).

The ADC is a “black box” conceptually connected through just two wires.

One wire is the input, bringing in the analog voltage signal, with a maximum allowed range of some Volts (positive and negative). The second wire is the output, carrying out the digital information as a serial digital interface (bit after bit).

The ADC operation can internally be quite complex, but looking at it from

outside we see basically just two types of converters, differentiated by the format of

the output digital signal: PCM converters (Pulse Code Modulation) and Bitstream

(8)

converters (also called DSD, Direct Stram Digital, or simply single-bit).

The two techniques have some point in common (and, internally, either type of converters can make use of the same components and processing): the two relevant quantities of a multibit (PCM) converter are the sampling rate and the bit depth, whilst for a Bitsream converter the number of bits is fixed to 1, whilst the sampling frequency can vary, being systematically much higher than the one employed in PCM comverters.

Let’s start from the PCM converters, which are by far the more widely employed. A master clock defines with high precision the instants at which the analog signal has to be “sampled”. The operation of a PCM converter is bounded by the Shannon theorem, which forces to use a sampling frequency which is at east double of the highest frequency contained in the analog signal. For audio applications, typical sampling frequencies are 44100 Hz (CD), 48000 Hz (DAT, DVD-video) and 96000 Hz (DVD-Audio, HD recorders). Even 192000 Hz can be used in DVD-video (for 2 channels only) and on some soundcards employed on modern computers.

As already explained, the goal of having sampling frequencies much higher than 40 kHz is generally not to allow for an extended pass-band, but to allow for low- pass filters with a more gentle rolloff, ensuring crisp response to transients. However, another possible approach is to record with extended frequency pass-band, and later apply the “apodising” filters directly in the digital domain.

In any case, it is VERY important that substantial low-pass filtering is applied in the analog domain, before entering the ADC chip. In fact, if we feed the converter with signal having a content exceeding its limiting frequency (called Nyquist frequency, and equal to half the sampling rate), the numerical representation of the signal is distorted, due to a phenomen known as “aliasing”, which reflect back to the sub-Nyquist frequency band any signal having a frequency above it.

If, for example, we feed without proper low-pass filtering a signal containing a pure tone (coming, for example, from a CRT monitor) at 35 kHz in a system working with a sampling frequency of 48 kHz (hence with a Nyquist frequency of 24 kHz), we will get this tone “aliased” down to 13 kHz (it was 11 kHz above Nyquist, so we find it 11 kHz below it).

One technique often used for reducing the aliasing problems is oversampling.

Instead of operating the converter at its nominal sampling frequency, we operate it at doubled frequency. This way, the chance that very high frequency content contaminates the conversion process is reduced. After the conversion is done, the data flow is “decimated” (if the sampling frequency was doubled, a sample each of two is discarded), of course after applying a proper low-pass filter (aliasing can occur also when decimating in digital domain, but designing suitable digital low pass filters is easy and cheap, compared to designing optimal analog filters).

As the most basic low-pass filter is obtained simply summing together the two consecutive samples, we see that not only the oversampling caused less aliasing problems, but it also extended the amplitude resolution of the converter.

The amplitude resolution is limited by the fact that the “digital” signal has

been discretised not only along the time axis (due to sampling), but also along the

vertical “voltage” axis, due to the fact that the converter represent numbers with finite

(9)

precision, limited by the number of bits employed for expressing the binary number which is proportional to the input voltage.

In the past, most converters were operating at 16 bits. These 16 bits were typically employed as a signed integer (the actual internal format of these number is usually called “complement of 2”, but this is of no importance here). In practice, the maximum positive voltage allowed (for example +5V) is mapped to the maximum allowable value for a 16-bit number, which is +32767. Similarly, the maximum negative voltage (-5V) is mapped to -32767. And if there is no signal, the converter should output a digital 0.

Modern converters have a much better amplitude resolution, typically around 20 bits. As each bit correspond to doubling the signal amplitude, each additional bit extends the dynamic range of the converter by approximately 6 dB. A good 16-bits converter has a dynamic range close to 90 dB, a modern 20-bit converter is around 114 dB.

Figure 4 – an high-end, 2-channels ADC unit (24 bits, 192 kHz, Firewire interface) Although modern converters are claimed to output a 24 bit PCM stream, it must be noted that no one till now managed to obtain a dynamic range really exceeding 120 dB, which correspond, at most, to 21 bits of true resolution. This means that the three or four least-significant bits of a 24-bits converter do not really carry any useful information, they only contain “garbage” or “digital noise”. The presence of this noise, however, can be very useful in subsequent digital processing, as it can mask some nasty digital artifacts, and provide the subjective impression of a wider dynamic range. If there is not enough noise, it can be advantageous to add it.

And, as we now are able to optimize the spectral and temporal characteristics of this artificial noise, we can “shape” it, so that it sound nice (“noise shaping”).

After applying a proper amount of noise, we can even reduce the bit depth (for example, when creating a CD, which is 16-bits, from a professional 24-bits recording), maintaining a perceived wide dynamic range, and making even audible very weak signals which had been completely wiped out if applying the bit depth reduction without the addition of this beneficial “dither” noise.

Let’s come back to the oversampling technique: we have seen that, after doubling the sampling frequency, low-pass filtering and decimating, we obtain a signal correctly sampled at the nominal sample rate, but with extended amplitude resolution (due to the fact that, summing two 16-bits values, we obtain 17 bits). If we had quadrupled the sampling frequency, after averaging and decimation we get 18 bits, and so on.

So the idea was born to extend the sampling frequency to several megahertz, and running a converter with just a low number of bits – the extra bits come out after decimation. The final result has been the creation of the so-called “single bit”

converters, in which the sampling frequency has been stretched up to 2.88 MHz,

(10)

whilst the resolution is just one bit.

If we want to recreate a PCM (multibit) signal from this high-frequency stream of single bits (a Btsream, as it is called), we need to apply low pass, and decimate, several times. Each time we half the sampling frequency, we get one bit more. After halfing 6 times, we are back at the standard CD sampling frequency of 44100 Hz, and we have got… just 7 bits!

This means that the single bit converter is in reality much poorer than the traditional 16-bits converter of the CD era.

Well, through proper noise shaping of high order, it is possible to increase the resolution at "quite high frequencies" at the expense of resolution at very high frequencies, but only for static, non transient signals. Transient signals will have poor resolution in a one-bit system. If the signal does not endure for a long enough time, the error will not be minimised by the noise shaper of the one bit system.

On the other hand, at lower frequency the resolution of a Bitstream converter becomes better and better. The 16-bits equivalence, however, is encountered at 88 Hz, and only below this frequency the Bitstream converters systematically outperform 16- bits PCM converters. Again, noise shaping can ameliorate this figure, but in any case what You get better in one frequency range cause additional noise at other frequencies.

The most widely employed embodiment of the Bitstream technology is in the SACD system (Super Audio CD), co-developed by Sony and Philips. Through proper dithering and noise-shaping, the performance in the audible range is optimized, and the great idea was to NOT convert the bitstream in PCM, but to store it directly on the disc, and reproduce it through a very simple Bitsream DAC (digital to analog converter). In this way, all those stages of low-pass-filtering and decimation are removed, and the data flow is remarkably simple and straightforward. Thanks to the above facts, the SACD method is in practice better than 16-bits CD, although it cannot obviously rival the extended dynamic range, wide frequency response and dramatic sharpness to transient response which are nowadays available when working with PCM converters at 24 bits / 96 kHz.

It must also be said that working with Bitstream digital data is not as easy as working with PCM samples: computers and software are much more suitable for PCM processing, and even the most basic operation, such as mixing two signals, becomes a big trouble when operating on single-bit waveforms. Finally, the Bitstream hardware is much more expensive and there are no such things as portable Bitstream recorders or USB/Firewire sound cards.

In conclusion, both for recording music and for performing room acoustical

measurements, it is advisable to employ a good 24-bits, 96 kHz PCM converter. Such

a solution is nowadays simple and very cheap, even employing a laptop computer,

thanks to the availability of small external converter boxes, interfaced to the PC by

means of the Firewire or the new USB2 interface. These units are typically equipped

with 8 analog inputs, 8 analog outputs, and cost less than half of an entry-level laptop.

(11)

Figure 5 – a low cost multichannel USB-2 soundcard, equipped with 2 microphone preamps

1.3 Digital Signal Processing

Once the acoustical signal has been transformed into the digital domain, it is possible to manipulate it through algorithms and formulas, as the sound is now represented by a stream of numbers equally-spaced along time. It is important to familiarize to the various possible representation of the signal, and how to display its frequency-domain content versus time.

The most basic way of displaying a sampled waveform is to emulate an

oscilloscope. Figure 6 shows a detail of a wovel (“a”) sampled in an anechoic room

(so the sound is “dry”, without any echoes or reverberation).

(12)

Figure 6 – sampled waveform displayed as amplitude vs time

Looking at the time-domain representation of a waveform is useful for basic editing (cutting and pasting at precise time positions), and for detecting specific events (arrivals of echoes in an echogram, for example). However, humans are often also interested into the frequency content of the waveform, and traditionally the results of a frequency analysis is displayed as a sound spectrum: a Cartesian plot of amplitude versus frequency (normally both the amplitude axis and the frequency axis employ logarithmic scaling).

The mathematical operations required for estimating the spectrum of a given time-domain signal can be very heavy from the computational point of view, depending on the chosen algorithm. For this reason, when the computers available were slow and with little memory, one algorithm did outperform all the others: the FFT (Fast Fourier Transform), which provide very good resolution in the spectral domain, requiring very limited usage of the computer’s resources.

Of course, nowadays we are free to choose other alternative algorithms which do not have the intrinsic limitations of FFT, as the currently available computing power allows for real-time implementation of any other spectral analysis approach, even the most “heavy” ones. However, this possibility is substantially not exploited, also because whole generations of technicians were grown up learning that FFT was the only technique for getting an evaluation of the frequency content of a signal.

We need here to emphasize that the FFT algorithm has some very strict

drawbacks, that our human hearing mechanism DO NOT EMPLOY FFT, and that the

unique usage of this algorithm makes it impossible for a measuring instrument to pair

the time-frequency resolution of the human ear.

(13)

Before discussing such limitations, we provide here a very basic description of what the FFT does.

First of all, FFT efficiently computes the spectrum of a chunk of time-domain waveform, usually called an “FFT block”. The processing is really efficient only if the number of samples being processed is multiple of 2, such as 1024, 4096 or 65536 (these are three very common lengths for FFT blocks).

The result of the analysis of such a block of N numbers, results in exactly the same “number of numbers” as output. But now these numbers do not represent anymore a time sequence of a real-valued quantity (acoustical pressure or particle velocity), they represent N/2 complex values equally spaced along the frequency axis, starting from DC (0 Hz) up to the Nyquist frequency. Well, actually the number of spectral lines is N/2+1, but the DC value and the Nyquist values are always real, so the total “number of numbers” is still N. This makes it possible to perform the FFT transformation “in place”, overwriting the same memory locations which did initially contain the time-domain waveform.

It must be noted here that when doing the FFT of a real-valued sequence, only the positive part of the frequency axis is obtained. In some branches of science, instead, it is common to apply the FFT to complex-valued sequences, obtaining a

“complete” spectrum which extends also to “negative” frequencies down to minus the Nyquist frequency. This mathematical approach is not only not useful in acoustics, but even dangerous, because it can easily lead, after going back from the frequency- domain to the time domain by means of inverse FFT transformation (IFFT), to complex-valued “signals”, which of course are inexistent in nature and cannot be reproduced (“played”) on existing sound reproduction systems.

It is therefore NOT recommended to employ, for example, the standard Matlab’s FFT routines, which are designed for managing complex-valued sequences, and bring to misleading results in acoustics.

However, it must be noted that, even if the time-domain signal was real, its spectrum is made of complex numbers. As any complex number can be represented as magnitude and phase, one could wonder what is the physical meaning of this frequency-dependent phase information. But the answer is, generally, that it has no meaning. The humans are NOT sensitive to the absolute phase of the sound. This absolute phase is dominated by a very casual fact, the instant in which the FTT block was started. If analyzing a steady-state signal, one finds easily that, whilst the magnitude of the spectral lines remain substantially always the same, the phase changes depending on instant in which the FFT block initiated. And humans do not perceive any variation in the signal, albeit its phase changes continuosly as the time runs…

So it is common to plot just the magnitude of the spectrum versus frequency,

as it is shown in fig. 7 (this is the frequency analysis of the same sound sample

already shown in fig. 6).

(14)

Figure 7 – Magnitude spectrum of the sound signal shown in figure 6.

But, albeit the absolute phase is meaningless, the RELATIVE phase between TWO signals can be very relevant: let’s consider the case of a binaural dummy head (a special stereo microphone setup, in which two small microphones are located at the entrances of the ear of a head-and-torso mannequin, as shown in fig. 8). Now the relative phase between the signals captured at the two ears has a very specific physical meaning, and is strongly related with the capability of humans to localize the direction of provenience of the sound.

Another similar example occurs when a pressure-velocity microphone pair is available: sampling simultaneously the two physical quantities (sound pressure and particle velocity) in the very same point of the sound field makes it possible to analyze the phase relationship existing among them, and derive quantities which describe the nature of the sound field (an “active” sound field has pressure and velocity in-phase, a “reactive” sound field exhibit typically a 90° shift between pressure and velocity).

The analysis of the interrelationship between two or more channels is called cross-spectral analysis, and runs far beyond the goal of this text.

What is important in the following, however, is a very special case of dual- channel analysis, in which one has access to the signals both before and after a

“system”, as shown in the following scheme:

(15)

Figure 8 –dual-channel analysis of the input and output of a system.

By careful manipulation of the spectra of the signals x(t) and y(t), it is possible to derive an indirect analysis of what it is happening inside the system, which result particularly simple and accurate if the system is perfectly linear and time-invariant.

By “system” we think here either at electronic devices (in which case the arrows feeding it and exporting the output are really electrical wires) or at acoustical devices (in this case, perhaps the source is a musical instrument, and the receiver is an human being). In the latter case, the black circles from which the x and y signals are extracted are suitable transducers, such as microphones.

It is common to employ quasi-stationary signals for performing these analyses, and to average over dozens of FFT blocks so that the cross-function estimate stabilizes to a reproducible magnitude and phase spectrum, expressing in principle the cause-effect relationship among x and y, and formally established as the ratio between the two spectra (Y/X).

However, even if employing stationary signals generated on purpose, the FFT algorithm still have its own strong limitations, which we now quickly revise:

 The frequency resolution is constant, whilst the human hearing exhibit narrow resolution at low frequency, and coarse resolution at high frequency

 The frequency resolution can only be improved employing longer FFT blocks, which means “averaging” over long times. If the user wants to see the “instantaneous” spectrum dancing quickly on the screen, the FFT block size (and hence the frequency-domain resolution) must be kept low (and this limit has nothing to do with the available computational power).

 The signal should be in principle perfectly steady and periodic, and having a period length exactly equal to the FFT block size (N). The

Source System receiver

x() y()

2-channels analyzer

(16)

usage of a not-periodic signals is possible, but the fact that the end of the FFT block will not “match” with the beginning causes a sudden

“jump” in the amplitude of the waveform, which appears as a flat nosie spectrum throughout the whole spectrum. For avoiding this “border”

effect, it is customary to employ a suitable amplitude reduction at the beginning and end of the FFT-block (a sort of fade-in and fade-out of the signal). Various windowing functions have been developed for obtaining this effect, denoted by curious names such has Hanning, Hamming, Blackman, Blackmann-Harris, Kaiser, Welch, etc….

 As the signal at beginning and end of the FFT block is zeroed by the usage of these windows, in practice what happens in these moments is discarded from the analysis. For processing all the data with the same weight, it is therefore necessary to employing partially overlapped FFT blocks: a common solution is to employ a 75% overlap, so, if the FFT block size is 4096 points, the FFT analysis is repeated every 1024 samples, processing the newest 1024 data points together with the previous 3072 samples.

All of these limitations can be overcome only abandoning the efficient FFT scheme, and moving to other more computationally-extensive spectral analysis methods: DFT (Discrete Fourier Transform) allow for variable frequency resolution (for example, employing logarithmically-spaced frequencies), fractional-octave filter banks produce spectra having constant percentage bandwidth (more similar to human hearing mechanism), and the usage of the Wavelets transform make it possible to trade-off frequency and time resolution differently for each spectral line, enabling the possibility to closely match the human’s capability of detecting quick pitch variations in short times.

Although these possibilities do exist, most of the digital signal processing is still done nowadays working through FFT and IFFTs: when one wants to explore the spectrum variation over time, the common approach is to use STFFT (short-time- FFT), which is simply a graphical representation of a series of spectra obtained analyzing subsequent FFT blocks of limited length. The spectra can be plotted in a 3D waterfall display, or can be organized in a 2D colour plot, in which the time is the horizontal axis, the frequency is the vertical axis, and the magnitude of the signal is represented by a continuous color variation (or white-black contrast).

Fig. 9 shows this 2D representation, usually known as Sonograph or

Sonogram.

(17)

Figure 9 – Sonogram of a piece of speech (word “zero”).

Finally, it is to be noted that most sound editing software actually available provide dozen of “sound effects”, usually obtained by loading suitable plugins. These effects can be used for changing the sound even heavily, and provide tools for performing filtering (equalization, time-variant spectral modification, pitch-shifting, etc.), de-noising, compression or expansion of the dynamics, soft or hard clipping, crossfades between different sound samples, and controlled distortion (also known as harmonic expansion).

Most of these effects are employing complex mathematical formulations, which are far beyond the goals of this work. It is indeed useful that the acoustician or the musician are familiar with these possibilities, which allow for complete morphing the sounds working in the digital domain. And, as modern processing systems performs all of these operations working with floating-point numbers in double precision, the quality and cleanliness of the results is significantly better than that obtainable with analog processors, or with digital “outboard” effects. These typically did employ limited-precision, fixed-point math, causing those famous nasty artifacts which partially destroyed the reputation of digital sound processing.

Nowadays it appears completely obsolete to perform any processing

employing fixed-point math, as on a modern computer or DSP system it is possible to

work with 64-bits floats, making it feasible to crunch the numbers million of times

without that even the weakest truncation noise or artifact appears.

(18)

2 ROOM ACOUSTICS MEASUREMENTS WITH SINE SWEEPS 2.1 Introduction

The actual state-of-the art of audio measurements is represented by two different kinds of measurements: characterisation of the linear transfer function of a system, through measurement of its impulse response, and analysis of the nonlinearities through measurement of the harmonic distortion at various orders. These two measurements are actually well separated: for the impulse response measurement the most employed technique are MLS (Maximum Length Sequence) and TDS (Time- Delay Spectrometry). Both these methods are based on the assumption of perfect linearity and time-invariance of the system, and give problems when these assumptions are not met. In particular MLS is quite delicate, it does not tolerate very well nonlinearity or time-variance, and requires that the excitation signal is tightly synchronised with the digital sampler employed for recording the system's response.

The novel technique employed here was developed while attempting to overcome to the MLS limitations through TDS measurements. It was discovered that employing a sine signal with exponentially varied frequency, it is possible to deconvolve simultaneously the linear impulse response of the system, and separate impulse responses for each harmonic distortion order. In practice, after the deconvolution of the sampled response, a sequence of impulse responses appears, clearly separated along the time axis. By FFT analysing each of them, the linear frequency response and the corresponding spectra of the distortion orders can be displayed. This means that the system is characterised completely with a single, fast and simple measurement, which proved to compare very well with traditional techniques for measuring the linear impulse response and the harmonic distortion. Furthermore, the system revealed to be very robust to minor time-variance of the system under test, and to mismatch between the sampling clock of the signal generation and recording.

This document presents the theoretical background of the measurement method, and attempts to explain physically what happens and how the results are obtained.

Then some experimental results are reported, which demonstrate the capabilities of the new technique in comparison with established measurement methods.

2.2 Theory

We start taking into account a single-input, single-output system (a black box), in which an input signal x(t) is introduced, causing an output signal y(t) to come out.

Common assumptions for the system are to be linear and time-invariant, but we will able to release these constraints in the following. Inside the system, some noise could be generated, and added to the “deterministic” part of the output signal. Usually this noise is assumed to be white gaussian noise, completely uncorrelated with the input signal. Fig. 10 shows the flow diagram of such a system.

In practice, the output signal can be written as the sum of the generated noise and a deterministic function of the input signal:

 x ( t ) 

F ) t ( n ) t (

y  

(19)

“Black Box”

F[x(t)]

Noise n(t)

input x(t)

+ output y(t)

Fig. 10 – A basic input/output system

If the system is linear and time-invariant, the function F assumes the form of the convolution between the input signal and the system’s impulse response h(t):

) t ( h ) t ( x ) t ( n ) t (

y   

If now we release the constraint for the system to be linear, we have a much complex case, which cannot be studied easily. But often the nonlinearities of the system happen to be at its very beginning, and are substantially memoryless. After this initial distortion, the signal passes through a linear subsequent system, characterized by evident temporal effects (memory). This scenario is typical, for example, of a reverberant space excited through a loudspeaker: the distortion occurs in the electro-mechanical transducer, but as the sound is radiated into air, it passes through a subsequent linear propagation process, including multiple reflections, echoes and reverberation.

Not-linear system K[x(t)]

Noise n(t)

input x(t)

+ output y(t) linear system

w(t)h(t) distorted signal

w(t)

Fig. 11 – A more complex system, in which a not-linear, memoryless device drives a subsequent linear, reverberating system

Fig. 11 shows such a composite system. In practice, we can assume that the input signal first passes through a memoryless not linear device, characterized by a N-th order Volterra kernel k

N

(t), and the result of such a distortion process (called w(t)) is subsequently reverberated through the linear filter h’(t).

A memory-less harmonic distortion process can be represented by the following

equation:

(20)

) t ( k ) t ( x ...

) t ( k ) t ( x ) t ( k ) t ( x ) t ( k ) t ( x ) t (

w  

₁



²



₂



³



₃

 

^N



_N

As the convolution of w(t) with the following linear process h’(t) possesses the distributive property, we can represent the measured output signal as:

) t ( ' h ) t ( k ) t ( x ...

) t ( ' h ) t ( k ) t ( x ) t ( ' h ) t ( k ) t ( x ) t ( n ) t (

y   

₁

 

²



₂

  

^N



_N



In practice, it is difficult to separate the linear reverberation from the not-linear distortion, and we can assume that the deterministic part of the transfer function is described by a set of impulse responses, each of them being convolved with a different power of the input signal:

) t ( h ) t ( x ...

) t ( h ) t ( x ) t ( h ) t ( x ) t ( h ) t ( x ) t ( n ) t (

y   

₁



²



₂



³



₃

 

^N



_N

Other considerations are needed for describing not-time-invariant systems. In such systems, the impulse responses h

N

(t) do not remain always the same, but change slowly in time. The variation is usually slow enough for avoiding audible effects such as tremolo or other form of modulation, and in most cases there are not significant differences in the objective acoustical parameters or in the subjective effects connected with different “instantaneous” values of the changing transfer function.

Simply, this continuous variation poses serious problems during the measurements, as it impedes to use the averaging technique for removing the unwanted extraneous noise n(t): increasing the number of averages, in fact, not only the contaminating noise n(t), but also the variable part of the transfer function is rejected.

Now, let we go back to the most common assumptions of linear, time invariant system characterised by a single transfer function h(t). A common practice for measuring the unknown transfer function is to apply a known signal to the input x(t), and to measure the system’s response y(t). For this task, the most commonly used excitation signals are wide-band, deterministic and periodic: these include

 MLS (Maximum-Length-Sequence) pseudo-random white noise

 Sine sweeps and chirps

The Signal-To-Noise ratio (S/N) is improved by taking multiple synchronous averages of the output signal, usually directly in time domain, prior to attempt the deconvolution of the system’s impulse response. Let we call ^y ^ˆ ⁽ ^t ⁾ the averaged output signal. As both the input and output signal are periodic, a circular convolution process relates the input and the output. If we suppose that the noise n(t) has been reasonably averaged out thanks to the large number of averages, we can employ FFTs and IFFTs transforms for deconvolving h(t):

 

  ^ _ ^

 

 

) t ( x FFT

) t ( yˆ IFFT FFT

) t ( h

Another common approach is to perform the averages directly in the frequency

domain (through the so-called auto-spectrum and cross-spectrum), computing the

frequency response function known as H

2

, and then taking the IFFT of the result:

(21)

 

 



 



AA 2 AB

G IFFT G )

H ( IFFT )

t ( h

In both the above approaches, due to the continuous repetition of the test signal and the fact that a circular deconvolution is performed, there is the risk of the time aliasing error. This happens if the period of the repeated input signal is shorter than the duration of the system’s impulse response h(t). This means that, with MLS, the order of the shift register employed for the generation of the sequence must be high enough, depending on the reverberation time of the system: modern MLS measurement equipment can produce very high-order MLS signals [1], but previous systems occurred easily in the time-aliasing problem, which causes the late part of the reverberant tail to fold-back at the beginning of the time window containing the deconvolved h(t).

With sine sweeps or chirps, it is common to add a segment of silence after each signal, for avoiding the time aliasing problem: if the data analysis window is still coinstrained to be of the same length as the sweep, the late part of the tail can be lost, but it will not come back at the beginning of the deconvolved h(t) (appearing as noise before the arrival of the direct wave). This is a first advantage of the traditional sine- sweep method over MLS.

Fig. 12 – Time Aliasing – if the MLS measurement is made employing a period of 65536 samples instead of 131072 samples, the right half of the impulse response

becomes ovberlapped over the first half.

What is not widely known is that also not-linear behavior of the system (i.e., harmonic distortion) can cause time aliasing artifacts, also if the length of the input signal is properly chosen. In practice, at various positions of the deconvolved impulse response strange peaks do appear: looking at these “distortion products” in details, reveals that they resemble scaled-down copies of the principal impulse response. This is clearly evident when making anechoic measurements of a loudspeaker, and applying to it too much voltage: the unwanted, spurious peaks appear after the anechoic linear response, both employing MLS and sine sweep.

A mathematical explanation of the appearance of the spurious peaks in the MLS

(22)

case was given in [2]. Fig. 13 shows a typical MLS measurement affected by untolerable distortion, which produces evident spurious peaks.

Fig. 13 – a MLS measurement made in presence of a strongly not-linear system Making use of sine sweeps in which the instantaneous frequency is made to vary linearly with time, the appearance of spurious peaks is not very evident: the distortion products simply cause a sort of noise to appear everywhere in the deconvolved h(t).

This “noise” is actually correlated with the signal input, so it does not disappear by averaging. It usually sounds as a decreasing-frequency low-level multitone.

Instead, if the sine sweep was generated with instantaneous frequency varying exponentially with time (the so-called “logarithmic sweep”), the spurious distortion peaks clearly appear again, with their typical impulsive sound.

This was the starting point of the approach presented here: a method was searched for “pushing out” the unwanted distortion products from the results of the deconvolution process. The most straightforward approach was to substitute the circular deconvolution with a linear deconvolution, directly implemented in the time domain. This is very easy, if a proper inverse filter f(t) can be generated, capable of packing the input signal x(t) into a delayed Dirac’s delta function (t):

) t ( ) t ( f ) t (

x   

The deconvolution of the system’s impulse response can then be obtained simply convolving the measured output signal y(t) with the inverse filter f(t):

) t ( f ) t ( y ) t (

h  

Both fast convolution and inverse filter generation are nowadays easy and cheap

tasks, due to recently developed software [1,3]. With this approach, any distortion

products caused by harmonics produce output signals at frequencies higher than the

(23)

instantaneous input frequency: figg. 14 and 15 show a not-linear system response with a linear and logarithmic sweep excitation respectively, in the form of a sonograph.

Fig. 14 – linear sine sweep: excitation signal (above) and system response (below) in

the case of a weakly notlinear system exhibiting evident harmonic distortion.

(24)

Fig. 15 – logarithmic sine sweep: excitation signal (above) and system response (below) in the case of a weakly notlinear system exhibiting evident harmonic

distortion.

The convolution of the inverse filters causes these sonographs to deform (or to

“stretch”) counter-clockwise, so that the linear response becomes a straight vertical

line (followed by some sort of tail, if the system is reverberant). The distortion

products are pushed to the left of the linear response: in the case of linearly swept sine

they spread along the time axis, whilst in the case of exponentially-swept sine they

pack in “distortion peaks” at very precise anticipatory times before the linear

(25)

response. Figgs. 16 and 17 show the inverse filter and the results of the deconvolution process, again in the form of sonographs, for the linear sweep case;. figgs. 18 and 19 show the inverse filter and the results of the deconvolution process for the log sweep case.

Fig. 16 – sonograph of the inverse filter – linear sweep

(26)

Fig. 17 – deconvolution of the system’s impulse response after a linear sweep excitation

Fig. 18 – sonograph of the inverse filter – log sweep

(27)

Fig. 19 – deconvolution of the system’s impulse response after a log sweep excitation This different behavior can be explained by looking at the structure of the inverse filters (figs 16 and 18). First of all, in both cases the inverse filter is basically the input signal itself, reversed along the time axis (so that the instantaneous frequency diminishes with time). In the case of exponentially-swept sine, an amplitude modulation is added, for compensating the different energy generated at low and high frequencies.

It can be observed that the inverse filter has the effect to delay the signal which is convolved with it of an amount of time which varies with frequency: this causes the deformation of the sonographs, as it was clearly demonstrated by M. Poletti [4] for linearly-swept sine signal. This delay is linearly proportional to frequency for linear sweeps, and instead is proportional to the logarithm of frequency for the logarithmic sweep. This means that the delay is increasing, for example, of 1s each octave.

In practice, if the frequency axis of the sonograph is made linear when displaying

measurements made with a linear sweep, and is made logarithmic when displaying

measurements made with a log sweep, the excitation signal, the inverse filters and the

system response always appears as straight lines on the sonographs (this was done in

figgs. 14-19). Furthermore, also the harmonic distortions appear as straight lines: but

these are parallel to the linear response in the case of the log sweep, whilst they are of

increasing slope in the case of linear sweep (look at figures 14 and 15). Both inverse

(28)

filters stretch the sonographs with a constant slope, corresponding to the inverse slope of the linear response: this packs the linear response onto a vertical line (at a precise time delay, which equals the inverse filter length). Obviously, also the harmonic distortion orders packs at very precise times in the case of the log sweep, as all the lines had the same slope (for examples 1 octave/s); instead, the harmonic distortion present in a response produced by a linear sweep tends to stretch over the time axis, producing a sort of sweeping-down multi-tone signal which precedes the linear impulse response (fig. 16).

It is clear at this point that the use of the linear deconvolution, instead of the circular one, pushes all the distortion artefacts well in advance than the linear response, and thus enables the measurement of the system’s linear impulse response also if the loudspeaker is working in a not-linear region. This holds both for linear and log sweep, meaning that, if the goal of the measurement was simply to estimate the linear response, the log sweep has the only advantage over the linear sweep of producing a better S/N ratio at low frequencies.

In conclusion, the complete removal of distortion-induced artefacts is already a very important result compared with the traditional circular deconvolution approach.

But in the case of the log sweep another very important result can be obtained: if the sweep is slow enough, so that each harmonic distortion packs into a separate impulse response, without overlap with the preceding one, it is possible to window out each of them: and each of these impulse responses is strictly related with the diagonals of the Volterra kernels, convolved with the subsequent linear reverberation (if any), and thus to the terms previously named h

1

(t), h

2

(t) and so on.

For designing properly the excitation signal, and for retrieving each harmonic order response, what is needed at this point is a theoretical derivation of the starting time of each order’s distortion.

A varying-frequency sine sweep can be mathematically described as:

 f ( t ) 

sin ) t (

x 

It must be noted that, following the general signal processing theory, the instantaneous frequency is given by the time derivative of the argument of the sine function. Thus, of course, if f(t)=t, where  is constant, the instantaneous frequency is also constant and equal to  (in rad/s). But if, for example, we assume a linearly varying frequency, starting from 