ANGELO FARINA, ALBERTO AMENDOLA, LORENZO CHIESI, ANDREA CAPRA, SIMONE CAMPANINI
Dipartimento di Ingegneria Industriale, Università di Parma, Via delle Scienze 181/A 43124 Parma, ITALY
HTTP://pcfarina.eng.unipr.it - mail: angelo.farina@unipr.it
Spatial PCM Sampling:
a new method for sound recording
and playback
Introduction
• This paper presents the first attempt to use a new
recording/processing/playback method, and tells the story of its failure, from which we all can learn something.
• Of course, after the failure, a modified approach was
refined, and this revealed some significant advantages over traditional High Order Ambisonics, despite the number of theoretical and practical arguments affecting the new
method.
• A set of listening tests did proof the failure of the original approach, and the success of the modified approach, that surpassed the performances of HOA, even if not reaching the excellent results obtained with the simpler approach provided by the 3DVMS technology.
Topics
1. Definition of Spatial PCM Sampling (SPS) 2. Virtual Microphones, an example
3. SPS signals from a microphone array
4. SPS signals from theoretical encoding formulas 5. Processing SPS signals
6. SPS decoding, the “exact” way 7. SPS decoding, the “modified” way
8. Comparison with High Order Ambisonics (HOA) and with 3D Virtual Microphone System (3DVMS)
9. Conclusions
Spatial PCM Sampling
PCM modelling of a waveform and of a spatial balloon
A waveform is represented by a sequence of pulses, a balloon is a “sea urchin” of spikes
Spherical Harmonics vs. Spatial PCM Sampling
Whilst Sherical Harmonics are the “spatial” equivalent of the Fourier analysis of a waveform,
The SPS approach is the equivalent of representing a waveform with a sequence of pulses (PCM, pulse code
modulation)
1
2 32 3
32 virtual microphones
Spatial Fourier Sampling
A waveform can be expanded as the sum of a number of sinusoids (Fourier), Exactly as a balloon can be represented by the sum of a number of spherical
harmonics (Ambisonics)
Composite spatial balloon Composite spatial balloon
Recording-processing-playback
Both SPS and HOA have the same
recording-processing-playback framework
Microphone Array Microphone Array
Encoding
Encoding DecodingDecoding
Loudspeaker Array Loudspeaker Array
B-format Or P-format B-format
Or P-format
The intermediate format can be manipulated
Intermediate Format Intermediate Format
Direct recording-playback
3DVMS, instead, computes directly one virtual microphone for each loudspeaker, feeding it directly: less processing stages, less constraints
Microphone Array Microphone Array
Virtual Microphones (3DVMS) Virtual Microphones (3DVMS)
Loudspeaker Array Loudspeaker Array
Virtual microphones for encoding and decoding
Either in SPS or HOA, both the encoding and decoding stages can also be
represented as the synthesis of virtual microphone signals
Encoding
Encoding DecodingDecoding
Virtual microphones for encoding and decoding
In Encoding, they have the shapes of the spatial functions employed as intermediate format (spatial Dirac’s deltas for SPS,
spherical harmonics for HOA)
Encoding Encoding
Virtual microphones for encoding and decoding
In Decoding, they have the shapes defined by the decoding procedure for feeding the corresponding loudspeakers
Decoding Decoding
Decoding virtual microphones from HOA
Spherical Harmonics (H.O.Ambisonics)
The virtual microphones are obtained by linear combination of the B-format intermediate signals by applying proper weights. This limits spatial resolution, dynamic range and frequency range.
Virtual microphones
Decoding virtual microphones from SPS
Spatial PCM Sampling signals
The virtual microphones are obtained by linear combination of the SPS (P-format) intermediate signals by applying proper weights. As the SPS signals come from a larger number of microphones with simpler directivity patterns, they exhibit better spatial resolution, dynamic range and frequency range.
Virtual Microphones
The “total virtual microphone”
The complete encoding-processing- playback procedure can always be represented as a set of virtual
microphones feeding the loudspeakers Looking at the polar patterns of these
“total virtual microphones” provides a
visual display of the behaviour of the
complete system
Example: the 2
nd-order “exact”
decoder for 5.0 “surround”
The decoding coefficients were computed
imposing that placing a 2nd-order microphone in the center of the loudspeaker rig, it re-records the same B-format signals being fed to the decoder
2nd order Ambisonics microphone
Matrix of decoding Coefficients
Matrix of decoding Coefficients
2nd-order B-format signal (5 channels)
2nd-order B-format signal (5 channels) 5.0 “surround” loudspeaker rig5.0 “surround” loudspeaker rig
Example: the 2
nd-order “exact”
decoder for 5.0 “surround”
Computing the total virtual microphones for
each loudspeaker shows that this solution is
completely wrong…
Another decoder for 5.0 “surround”
Here is how the total virtual microphones of a
proper 2
nd-order decoder should behave…
The RAI-3DVMS project
• “Virtual” microphones with high directivity, controlled by mouse/joystick in order to follow in realtime actors on the stage. They should be capable to modify their directivity in a sort of
“acoustical zoom”.
• Surround recordings with microphones that can be modified (directivity, angle, gain, ecc..) in post- production.
• Get rid of problems with Spherical Harmonics GOALS:
We want to synthesize virtual microphones highly directive, steerable, and with variable directivity pattern
VIRTUAL MICROPHONES
Virtual Microphones from arrays of transducers
Linear Array Planar Array Cylindrical Array Spherical Array Processing Algorithm
processor
N inputs
M outputs i
N i
ij
j h x
y
1
Computation of filter coefficients
• No theory is assumed: the set of hi,j filters are derived directly from a set of impulse response measurements, designed
according to a least-squares principle.
• STEP1: a matrix C of impulse responses is measured,
• STEP2: the target polar pattern P of the virtual microphone is defined
• STEP3: the processing filters H are found by imposing that
and inverting the matrix.
• This way, the outputs of the microphone array are maximally close to the ideal responses prescribed
• This method also inherently corrects for transducer deviations and acoustical artifacts (shielding, diffractions, reflections, etc.)
C
H PThe microphone array impulse responses cm,d , are measured for a number of D incoming directions.
cki
STEP1: anechoic measurements
The microphone array impulse responses cm,d , are measured for a number of D incoming directions.
We get a matrix C of measured impulse responses for a large number P of directions
m=1…M mikes d=1…D sources
cki
STEP1: anechoic measurements
D , M d
, M 2
, M 1
, M
D , m d
, m 2
, m 1
, m
D , 2 d
, 2 2
, 2 1
, 2
D , 1 d
, 1 2
, 1 1
, 1
c ...
c ...
c c
...
...
...
...
...
...
c ...
c ...
c c
...
...
...
...
...
...
c ...
c ...
c c
c ...
c ...
c c
C
For SPS, the “virtual” microphone is chosen as a 4th order cardioid:
STEP2: Target Directivity
, 0 . 5 0 . 5 cos( ) cos( )
4P
nSTEP3 – solution of linear equation system
m = 1…M microphones
d = 1…D directions
Applying the filter matrix H to the measured impulse responses C, the system should behave as a virtual microphone with prescribed directivity P
h1(t) h2(t)
hM(t)
pd(t)
Target function
c1,d(t)
PD,v δ(t)
P1,v δ(t)
M
m
d m
d
m
h p d D
c
1
,
1 ..
But in practice the result of the filtering will never be exactly equal to the prescribed functions p …..
c2,d(t) cM,d(t)
We go now to frequency domain, where convolution becomes simple multiplication at every frequency k, by taking an N-point FFT of all those impulse responses:
We now try to invert this linear equation system at every frequency k, and for every virtual microphone v:
0.. / 2
..
1
1
, ,
, k N
D P d
H C
M m
d k
m k
d m
k DxVDxMk MxV
C H P
This over-determined system doesn't admit an exact solution, but it is possible to find an approximated solution with the Least Squares method
STEP3 – solution of linear equation system
Least-squares solution
We compare the results of the numerical inversion with the theoretical response of our target microphones for all the D directions, properly delayed, and sum the
squared deviations for defining a total error:
The inversion of this matrix system is now performed adding a regularization parameter , in such a way to minimize the total error (Nelson/Kirkeby
approach):
It revealed to be advantageous to employ a frequency-dependent regularization parameter k.
Q
k MxD
k DxM k
MxMk j MxD DxV
k k MxV
I C
C
e Q
H C
*
*
Spectral shape of the regularization parameter
• At very low and very high frequencies it is advisable to increase the value of .
Not-uniform spatial sampling with 32 spatial Diracs
Unfortunately, the largest regular polyhedron has 20 faces (icosaedron) – a 32-points sampling is slightly irregular
Creating synthetic SPS signals
Creating the SPS signals from a mono signal means to “spatially pan” it across the 32
P-format channels
Virtual source at (Azin, Elin) Virtual source at (Azin, Elin)
m-th virtual microphone at (Az , El ) m-th virtual microphone at (Az , El )
m
m
Creating synthetic SPS signals
The gain for each channel is easy to compute:
first the angle m between the direction of the sound source and the direction of the virtual
microphone is found with the Haversine formula:
Then the gain Qm is found by means of the 4th-order cardioid formula:
cos cos sin 2
sin 2 arcsin
2 2 m in m in 2 m in
m
Az El Az
El El
El
0 . 5 0 . 5 cos(
m)
4Q
m
Processing the SPS signals
Basically, it is important to perform two operations on the whole soundfield:
1. Rotate the whole scene around an arbitrary axis
2. “Stretch” the sound field, giving more
emphasis to the sound coming from some directions and reducing the sound from
other directions
Rotating the SPS signal
As with PCM in time domain, only “discrete” shifts are easy.
“Fractional” rotations require either SPS oversampling or going to spatial frequency domain (HOA)
The only simple “discrete” rotation is based on the permutation of the faces of a dodechaedron
Hence the “unit rotation step” is 72°, and just 6 rotation axes are available
Stretching the SPS signal
In SPS it is trivial to boost the sound coming from some directions and reduce the sound from others, it is just matter of adjusting the gain of the corresponding virtual microphones
Decoding the SPS signal
It is possible to derive a number of signals for feeding a loudspeaker rig by means of another “decoding matrix” of FIR filters
No decoding would be required if the rig is made of 32 loudspeakers located in the
same directions as the SPS virtual
microphones and all at the same distance
form the listening spot
From 32 virtual microphones to 16 loudspeakers
Casa della Musica, (Parma, ITALY)
Loudspeaker positions
Horizontal Ambisonics octagon Ambisonics 3D cubeStandard Stereo
Frontal Stereo-Dipole Rear Stereo DipoleUpper Stereo Dipole
16-loudspeakers playback system
Decoding the SPS signals to loudspeakers
32( )
,( ) )
(
i i rr
t y t f t
s
EigenmikeTM
32 virtual microphone signals y (SPS = P-format)
16 speaker feeds s
Matrix of 32x32 encoding FIR
filters
Matrix of 32x16 decoding FIR
filters
The transfer functions of the sound system are measured
The EigenmikeTM is placed in the center of the loudspeaker rig, and the transfer functions [k] are measured from each
loudspeaker to each of the 32 virtual microphones (SPS).
Then we impose that the resulting signals {yout} are identical to the virtual
microphone signals recorded by the Eigenmike in the original room {y}:
And consequently we solve for the unknown filters [f]
k1 k2
kR
yout s *
k
y *
f * kRe-recorded virtual microphone signals y
Theoretically “exact” solution
As for the encoding filter matrix, we employ Least-Squares with regularization:
However in this case the system is NOT over-
determined, as the number of measured transfer functions equates the number of filters to
compute
16* 32
32 16
32 32*
32 32 16
16
f x x
x
k j x x
I K
K
e F K
Theoretically “exact” solution
Hence the resulting filter matrix has some troubles:
There is signal coming from every virtual microphone to every loudspeaker…
Let’s focus on the first 8 loudspeakers (horizontal ring)
Theoretically “exact” solution
The troubles are even more evident looking at the
“total virtual microphones” for the 8 loudspeakers on the horizontal plane:
“Modified” solution
For avoiding these problems, a second set of encoding coefficients was computed, adding the constrain that each speaker feed is
obtained by just 1, 2 or maximum 3 virtual microphones
Real Loudspeakers
Real Loudspeakers SPS Virtual MicrophonesSPS Virtual Microphones
“Modified” solution
The resulting filter matrix is more “sensible”:
Each of the feeds for the first 8 loudspeakers (horizontal ring) get
“Modified” solution
The correct decoding is evident looking at the “total virtual microphones” for the 8 loudspeakers on the horizontal plane:
Alternative solutions
Other two decoding matrixes were used for comparison:
Results of subjective listening tests
Overall preference score (11 subjects)
Results of subjective listening tests
Spectral balance (9 subjects)
Results of subjective listening tests
Transient performance score (8 subjects)
Results of subjective listening tests
Overall preference score (10 subjects)
Conclusions
The most versatile method currently available for capturing a 3D acoustical scene is to employ a microphone array, and to derive a number of virtual microphones
Three processing methods have been developed and tested:
1. High Order Ambisonics (spherical harmonics functions)
2. Spatial PCM Sampling (spatial Dirac’s Delta functions)
3. Direct synthesis of discrete speaker feeds (3DVMS) Each of the three methods has some advantages and
disadvantages
In all cases it is possible to process and reproduce the same recording over a given playback system (loudspeaker array)
Currently the direct synthesis of speaker feeds (3DVMS) resulted to work better, but SPS, when employing a suitable decoding scheme, resulted better than traditional in-phase 3rd-order Ambisonics, and quite close to 3DVMS