FPGA Implementation of a Low Latency, Neural Network-Based Speech Enhancement Algorithm

(1)

Dipartimento di Ingegneria dell’Informazione Corso di Laurea Magistrale in Ingegneria Elettronica

FPGA Implementation of a Low Latency,

Neural Network-Based Speech Enhancement

Algorithm

Relatori:

Prof. Roberto Saletti

Candidato:

Niccol`

o Nicodemo

Sesto Appello

(2)

(3)

A te che leggi, se vorrai continuare Altre parole dovrai prima ascoltare.

I grazie fan presto a scomporsi nel vento Ma resta il ricordo di questo momento

Resta il sudore di chi ha dato tanto e, per questi miei anni, ha riso ed ha pianto

Per voi genitori, che avete imparato A nasconder le rughe e il dolore provato:

Un figlio che parte sa far lacrimare, Ma `e il gioco del mondo, e sapete giocare.

Giusto di giochi volevo parlare. Provetta aviatore, i miei li facevi volare

Gi`u dal balcone oppur dalle scale. Come potrei mai perdonare?

Eppure ogni giorno mi insegni un po’ il mondo, Ne sposti il confine e muovi il contorno

Di quello che so, o credevo sapere. Come luna d’estate dipinge le sere. A te più di tutti, sorella mia, è dedicata (Ma mai una macchinina ti sarà più prestata)

A tutti voi amici: devo a ognuno qualcosa, Ma abbiate pazienza, qui non ci state

(4)

(5)

Acknowledgements

I would like to express my sincere thanks to professor Roberto Saletti who kindly supported me and gave me the chance to work on this thesis and its related works and keep on learning, professionally and as a person, from this experience.

Thanks goes also to professor Tuomas Virtanen and to Konstantinos Drossos and Gaurav Naithani from University of Tampere for their kind assistance in work and in life during the months I spent in Finland and for the interest shown in my work and in its publication. Special thanks goes to my parents and my sister, to whom this work is dedicated, and to my old and new friends who made not only this experience, but the whole life of the past years worth to be lived and remembered.

(6)

(7)

Abstract

As one of the primary human needs, speeches are an ubiquitous presence in human life and their processing is one of the main information technology research topics. For that rea-son, well performing speech enhancement algorithms have been broadly and successfully used in many applications, such as hearing-aid devices or telecommunication systems, to improve intelligibility of degraded speech signals. Nonetheless, latest progresses made in Artificial Intelligence research field made speech enhancement algorithms appealing and useful for many more applications, such as speech recognition, patient monitoring and other health-related task.

Machine learning techniques involving the usage of neural network have been recently ap-plied to Speech Enhancement algorithms achieving state-of-the-art results. Nonetheless, high computational and resource requirements of neural networks hampered their usage on mobile devices, making therefore optimization and low-cost, low-power implementa-tion of those computaimplementa-tional systems an attractive research field.

The work herein presented analyses a speech enhancement algorithm based on a Fully-Connected Feed-Forward Neural Network, and proposes a feasible hardware implementa-tion of it. The considered network, developed and trained by the Audio Research Group of Tampere University, Finland, expressly targets low-latency and real-time applications. After a preliminary study of the algorithm and its theoretical bases, a system design is proposed and discussed thoroughly in all of its parts, with particular emphasis on neural network’s implementation and parameters storage. A specific quantization and encoding model of neural network weights was in fact developed and adopted in order to reduce memory footprint and relax bandwidth requirements. This model, based on a statistical analysis of networks weights and on the usage of companding technique, allowed us to drastically reduce memory consumption without appreciably affecting final results. Furthermore, a specific multiplier module based on integer arithmetic has been designed to

(8)

speed up neural network’s operations and still reducing computational resources needed. In the same spirit, a bit shift-based Coordinate Rotation Digital Computer (CORDIC) algo-rithm has been implemented.

All modules have been tested and validated individually by means of HDL simulations and field testing on a DE2 Development and Education board by Altera, which features a Cyclone II FPGA device.

(9)

List of Figures

1.1 Algorithm block diagram . . . 8

1.3 Example: 16-points FFT . . . 16

1.4 Visual improvements of zero padding . . . 17

1.5 Different windowing and their effects . . . 18

1.6 Overlap-Add method . . . 19

2.1 DE2 Board . . . 24

2.2 Cyclone II Logic Element . . . 25

2.3 Cyclone II EP2C20: Block Diagram . . . 26

3.1 Input buffer structure . . . 28

3.2 Input buffer module . . . 30

3.3 Input buffer simulation . . . 31

3.4 Input buffer simulation . . . 31

4.1 FFT Input and output timings . . . 35

4.2 FFT block diagram . . . 36

4.3 FFT test: input wave and fft’s output . . . 36

4.4 FFT Slowdown module: timing diagram . . . 37

4.5 FFT Slowdown module: address timing diagram . . . 37

4.6 IFFT Slowdown module: timing diagram . . . 38

4.7 IFFT Slowdown module: address timing diagram . . . 38

5.1 Weighted Sum . . . 42

5.2 CORDIC algorithm’s working principle . . . 43

5.3 CORDIC Algorithm . . . 44

(14)

5.5 CORDIC MP ESTIMATOR architecture . . . 46

5.6 Simulation Results . . . 47

5.7 CORDIC CMPLX REC workflow . . . 47

5.8 CORDIC CMPLX REC simulation . . . 48

5.9 CORDIC chain simulation . . . 48

6.1 4-layers FC Neural Network . . . 52

6.2 First layer neuron’s architecture . . . 54

6.3 Second layer neuron’s architecture . . . 55

6.4 Weightener’s architecture . . . 56

6.5 Weightener’s architecture - remainder recovery . . . 58

6.6 Weightener’s relative error . . . 59

6.7 Simulation results . . . 60

6.8 Layer 1 Gatherer’s simulation . . . 62

6.9 Gatherer’s architecture . . . 62

6.10 MMU and NNU connection . . . 64

6.11 Cache reader timing diagram . . . 64

6.12 Cache read simulation . . . 65

6.13 Layer 2’s cache read simulation . . . 65

6.14 Layer sync . . . 66

6.15 Layer sync simulation . . . 67

6.16 Neural network hierarchical structure . . . 69

6.17 M4K simple dual port block and buffer module . . . 69

6.18 M4K simple dual port’s timing . . . 70

6.19 Circular buffer:writing operations . . . 72

6.20 Circular buffer:reading operations . . . 72

6.21 Circular Buffer: buffers’ outputs . . . 74

6.22 Circular Buffer: module outputs, detailed view . . . 75

7.1 Weights on flash memory: timing issues . . . 78

7.2 SDRAM and core clocks . . . 78

7.3 Memory Management Unit . . . 80

(15)

7.5 FLASH READ . . . 81

7.6 FLASH CTRL . . . 82

7.7 Flash Wrapper . . . 82

7.8 SDRAM INIT CTRL . . . 83

7.9 Cross-domain synchronization . . . 84

7.10 SDRAM block diagram . . . 86

7.11 SDRAM CTRL state machines . . . 87

7.12 Excerpts of Flash . . . 89

7.13 External memory and LUT architecture . . . 92

7.14 Weights’ dynamic range partitioning . . . 94

7.15 Weights distribution and relative error plots . . . 98

7.16 maximum absolute errors . . . 99

7.17 STOI performance of quantization schemes: low bit-width . . . 100

7.18 STOI performance of quantization scheme: high bit-width . . . 101

8.1 First layer neuron’s architecture . . . 103

8.2 Working principle of buffer . . . 104

8.3 Working principle of implemented buffer . . . 106

8.4 Buffer simulation . . . 107

8.5 Buffer simulation: detailed view . . . 107

8.6 Output buffer: sinusoidal output . . . 108

8.7 Output buffer module . . . 109

8.8 Hann window and approximation . . . 111

8.9 Hann Window approximation . . . 112

8.10 Complementary solution . . . 113

8.11 Chosen solution . . . 113

8.12 OLA Testbench . . . 115

8.13 Input buffer and Hanning . . . 116

8.14 Overlap-Add output . . . 116

8.15 FFT’s magnitude comparison . . . 116

9.1 Audio Controller . . . 118

(16)

9.3 Audio Codec: Configuration Module . . . 120

10.1 Top View of the system . . . 122

10.2 TIMING CTRL behaviour . . . 122

10.3 FFT SAMPLE DROP . . . 123

10.4 NET OUT BUF SYNC behaviour . . . 124

11.1 FFT/IFFT and cordic testbench . . . 125

11.2 FFT input and output . . . 126

11.3 Cordic evaluations . . . 127

11.4 FFT input and IFFTs output . . . 128

11.5 Neural network evaluations . . . 129

12.1 Analysis and Synthesis Summary . . . 135

12.2 Analysis and Synthesis: Resource utilization . . . 135

12.3 Analysis and Synthesis: State Machine . . . 136

12.4 Fitting Summary report . . . 136

12.5 TimeQuest results for Slow model . . . 139

(17)

List of Tables

3.1 Buffer’s border address . . . 29

3.2 Input Buffer connections . . . 30

4.1 FFT Slowdown connections . . . 38

5.1 CORDIC MP ESTIMATOR Testbench results . . . 50

5.2 Chain testbench results . . . 50

6.1 Weightener: example values . . . 61

6.2 Buffer outputs . . . 73

6.3 CIRC BUF connections . . . 74

7.1 Flash timings . . . 80

7.2 Weights of the first neuron . . . 88

7.3 example of LUT . . . 95

7.4 Results for 2800 samples, SNR ∈{+5dB,0} . . . 99

8.1 Output Buffer ports . . . 109

8.2 approximated Hann function: only bit shifts . . . 114

8.3 approximated Hann function: bit shifts and sums . . . 114

8.4 approximated Hann function: chosen one . . . 114

(18)

(19)

Part I

Speech Enhancement:

general overview and algorithm description

(20)

(21)

Introduction

The presence of background noises in speeches is a problem for both humans and comput-ers since it weakens voices intelligibility for the former listencomput-ers and forces the latter to use more resources (e.g wider dynamic ranges) to record and properly analyze audio sam-ples. Finding new and better algorithms capable of enhancing the quality of speech and reducing the amount of background noise is therefore an important task in signal process-ing. Furthermore, since speaking is one of the more, if not the most, common and essen-tial human activities, the consequences of such improvements spread in several fields and application[2] as hearing aids, mobile phones and Voice-over-IP, teleconferencing systems and automatic speech recognition .

However, even if background noise reduction is the target in all of those aforemen-tioned fields, each one of them may have different quality requirements and standards. As an example, the communication between an airplane’s cockpit and a ground control re-quires a robust noise reduction which suppress everything that is not human voice, while using that same quiet background noise in a hearing aid will lead to unnatural and uncom-fortable results for the patient[12].

In Speech Enhancement, the filtering process have therefore to be adjusted from time to time in order to meet the requirements of the considered task. That’s why neural networks and, more generally, Artificial Intelligence (AI) methods are highly appealing for those tasks. Their ability to learn a certain behaviour and adapt themselves to cope with spe-cific problems can easily fulfill Speech Enhancement’s requests. Nowadays, in fact, neural network have been proved to perform well on those task, achieving also state-of-the art results.

Unfortunately, those devices requires both in the learning phase and in their final inference, expensive technological resources to work: huge memories, fast processors and computa-tional units, high power consumption. That’s why they are barely used on mobile devices

(22)

or embedded systems, which does not have instead such expensive resources and have to run on small battery power.

Nonetheless, the interest in having such high-performance algorithms used on those small devices is really high, pushing the research on optimizing them for low resource and low power usage. Our work walks this path, exploring some techniques for an hardware im-plementation of a neural network-based speech enhancement algorithms.

Using a specifically designed hardware for carrying out a specific task involves a high optimization which removes all software overhead, resulting thus in less power consump-tion, higher throughput and more relaxed resource requirements (e.g. slower memories can be used). Furthermore, using Field Programmable Gate Arrays (FPGAs) leaves an open door for more advantages, like those of an on-demand reconfiguration, already exploited in hardware accelerators.

A FPGA is an integrated circuit capable of being configured after its manufacturing, enabling thus the final user to implement its own digital design on a chip avoiding the full costs of designing and producing a specific integrated circuit for a single application (an Application-Specific Integrated Circuits- ASIC). Furthermore, since configuration in FPGA is often held in RAM memory, this last economic benefit is very appealing and exploited when it comes to prototyping: A design can be tested by configuring it into an FPGA, then modified and tested back by overwriting the previous FPGA configuration without need to produce and pay for another chip.

Such programmable devices are sometimes preferred to ASIC even in production stages be-cause, for lower production volumes, FPGA’s non redundant engineering costs are cheaper than ASIC’s, as shown in figure 1a.

FPGAs consist of an array of programmable logic blocks and a hierarchy of re-configurable interconnections between them. Logic gates and small memory elements are usually in-cluded in those logic blocks, and a proper configuration of them allows the logic block to perform complex combinational or sequential logic. Nowadays, a FPGA chip in a rela-tively inexpensive evaluation board can contain something like 110K of those logic block (e.g. Terasic’s DE10 Standard, with Cyclone V FPGA chip, current market price: 350$). Usually FPGAs, aside from those logic elements, contains embedded memory blocks, which can be configured as random access (RAM) or read only (ROM) memories, phase-locked

(23)

(a) NRE costs comparison, from anysilicon.com (b) Altera’s EP300, Image credit: Altera

loops and reserved path for clock signals and other useful features as analog comparators or analog-to-digital(ADC) and digital-to-analog(DAC) converters. Furthermore, evalua-tion boards surround and connect the FPGA chip with addievalua-tional devices such as Audio and Video CODECs, Flash and SDRAM memories, LCD screens, USB connectors, led, keys and switches which makes those boards really powerful and versatile.

FPGA industry sprouted from the hard-wired programmable logic based on programmable read-only memory (PROM) and programmable logic devices (PLD). First re-programmable logic devices appeared in 1984, under Altera banner: Logic blocks configuration in EP300 was stored on EPROM memory, which could be erased by an ultra-violet lamp lit on the die from a quartz-window in the package (figure 1b).

In 1985 Xilinx developed XC2064, the first commercially viable FPGA with programmable gates and programmable interconnections between them. This device had 64 configurable logic blocks (CLBs) based on two three-input lookup tables (LUTs) used to achieve the logic functions. During the years, some competitors appeared like Actel (now Microsemi), but still nowadays, Xilinx and Altera (which, acquired by Intel in 2015, changed name in Tera-sic) are two leading industries in the market with their FPGAs and the software suites for FPGA configuration (Quartus for Terasic and Vivado for Xilinx).

In this work, Altera’s DE2 Evaluation Board and Quartus II software were used to propose an hardware implementation of a neural-network-based, speech enhancement algorithm developed by the Audio Research Group of Tampere University, Finland. We focused on a feasible implementation of a feed forward neural network by using integer arithmetic and Multiply-Accumulate operations, aiming for low latency and low power applications

(24)

with an optimized resource usage. We explored thus the chance of a low bit encoding for network’s parameter, capable of relaxing memory requirements and allowing the usage of DE2’s Flash and SDRAM devices. Common signal processing topics were discussed and then hardware implementations of a Fast Fourier Transform algorithm and of a Coordinate Rotation Digital Computer algorithm were designed.

The work is structured in three parts: the first part, to which this chapter belongs, contains a general introduction to the work and its topics. Part two describes the proposed system in all its parts and relative simulation results, while part three reports the results of fitting the design in the DE2 board.

(25)

Chapter 1 Speech Enhancement Algorithm:

A study case

1.1 Algorithm description

The considered speech enhancement algorithm takes raw audio data as input and performs its enhancement task with an acceptable latency for real time applications. Workflow can be described as follows: Input audio data are continuously acquired with a sampling rate of 8KHz, split in 128-points frames and a 256-points Short Time Fourier Transform (STFT) is performed on a zero-padded version of the last audio frame acquired, xi. Magnitudes Mi

and phases φi of the STFT output values are calculated and stored. Magnitude values are

then stacked together with magnitudes coming from the previous 7 frames and all those magnitude frames Mi, Mi−1, ..., Mi−7are then fed to a neural network.

The network used is a two layer, feed-forward fully-connected neural network which has been trained to estimate, from the stacked magnitudes of 8 consecutive frames {Mi, ..., Mi−7},

the magnitudes ˜Miof a cleaned-up version of the last acquired STFT frame, where the

in-tensity of noise components from the original audio signal have been reduced. At this points, the magnitudes ˜Miestimated by the neural network are combined with respective

phase values φiretrieved from the former signal and the Inverse STFT is performed on the

resulting spectrum, obtaining a processed version ˜xiof the analyzed audio frame xi.

Finally, audio output is calculated, starting from the so-obtained audio frames by applying the Overlap and Add technique with a 50% frames-overlap.

(26)

Signal-to-Noise Ratio as SNR = Ps

Pn

where Pnis the power of the noise signal and Psthe

power of the clean signal, then:

x(t) = s(t) + n(t) ˜

x(t) = f (x(t)) SN R(˜x) > SN R(x)

(1.1)

Figure 1.1: Algorithm block diagram

The algorithm is thus divided in three different logical parts. Last two of them will be discussed more in details in the following sections:

• Audio Data In/Out and Buffering • FFT and IFFT

• Neural Network

1.2 Neural Network

A neural network (NN) is a computing system capable of perform a specific task and, above all, learn how to perform it by analyzing and studying some carried out examples of it. Such devices take their inspiration and their name from neurons and their structured networks which constitute animal brains. Anyway, this inspiration is not about structure only, but about behaviour too: as human and animal brains are capable of learning by modifying the synapses, the connections between their neurons, neural networks learn too by modifying

(27)

the connections between the artificial neurons that forms the network itself.

Neural networks are in fact based on a collection of connected units, the artificial neu-rons, where each connection carries a real number and where each unit receives several in-puts, outputting some non-linear function of the sum of inputs. Those computing units are arranged in neuron layers.Neurons of a layer connect only to neurons of the immediately preceding and immediately following layers. Several layers can be found in a network, but at least an input layer and an output layer are needed.

For a single neuron, each and every input has a specific weight value, so that, as an ex-ample, first input (x0) will have its own proper weight w0. second input (x1) will have its

weight w1 and so on. Thus, transfer function for a neuron will be the following:

y = fq + N −1 X j=0 xj· wj (1.2) Where f is the non-linear activation function and q a bias.

The learning process of a neural network consist therefore in adjusting the values of its neurons’ weights so that the final output of the whole network matches, for a specific input, the desired output. Backpropagation method is often used to adjust weights and compensate for error found in the final output. Such method calculates the gradient of a properly-chosen cost function with respect to the weights, then update their values via stochastic gradient descent.

There are different connection patterns between neural network’s layers, thus different types of neural network. As an example, a Fully Connected Neural Network has every neuron in one layer connected to every neuron in the next layer. There can be instead some pooling between two layer, so that a group of neurons in one layer connects to a single neuron in the next layer reducing thereby the number of neurons in the second layer.

Feed ForwardNeural Networks are networks which does not allow feedback loops between one or several layers, while Recurrent networks allow connection between neurons in the same or previous layers

There are three major learning paradigms which correspond each one to a particular learning task:

(28)

of paired inputs and desired outputs are used in training the network and the cost function aims to eliminate the incorrect deductions (e.g. a commonly used cost is the mean-squared error)

• Unsupervised learning: aims to find previously unknown patterns in data set without pre-existing labels. Unlabelled input data and a task-dependent cost function are given together to the network.

• Reinforcement learning: aims to teach a network to perform actions that minimize long-term costs.

Despite the often exaggerated and misleading futuristic aura that surrounds, in the common opinion, Artificial Intelligence (AI) and Machine Learning (ML) topics, Neural Networks’s idea dates back to 40s and 50s. The subject was opend by Warren McCulloch and Walter Pitts in 1943, who created a computational model for neural networks [24], and carried on by D.O.Hebb and F.Rosenblatt who created the learning hypothesis known as Hebbian learning and the perceptron algorithm for supervised learning of binary classifiers, respectively.

Today’s technology with its continuous improvements and its ever-growing performances, enabled a deeper research which led to a true boom in interest in those topics. Neural net-works in particular, have been proved to perform well on a huge range of applications such as computer vision, speech recognition, machine translation, social network filtering, video games and medical diagnosis.

In our case of study, Neural Network is a Feed-Forward Fully-connected network with only two layers (input and output layers: 256 and 129 neurons, respectively). Supervised learn-ing with speeches from Wall Street Journal’s WSJ0 dataset and noises from TUT Acoustic Scenes 2016 dataset1_{has been used to train the network with Pytorch python library[31].}

Weights of the network have been quantized and encoded as described in section 7.6.1, in order to allow a proper storage in DE2’s Flash and SDRAM on board memories. Flash was, in fact, used as a non volatile memory to permanently store parameters but, since its speed grade constituted a bottleneck of the system, SDRAM was used to load parameters from Flash on device startup, and then supply them, when needed, to neural network as

1_{TUT Acoustic Scenes dataset[26] consists of sound recordings from 15 real-world environments, (e.g.,}

(29)

explained in chapter 7

1.3 Fourier Transform

When it comes to signal processing, in each and every engineering field, Fourier Transform is one of the most powerful and used mathematical tools. By using it, signals can be bro-ken down in terms of sinusoidal components, which are used in building up a consequent frequency domainrepresentation of the former signal. This decomposition is really useful because it simplifies analysis and processing of signals.

When a sinusoidal input is fed to a Linear-Time-Invariant (LTI) system, output signal will be a sinusoid of the same frequency but of different amplitude and phase. Thus when input signal is a sum of different sinusoids, the output of our LTI system will be the sum of the outputs obtained by feeding the system with those sinusoidal components one at a time. With an input signal x(t) = x1(t) + x2(t)where x1and x2are both sinusoids, a LTI system

T will output the following:

T (x) = T [a1x1(t) + a2x2(t)] = a1T [x1(t)] + a2T [x2(t)] = a1y1(t) + a2y2(t)

In 1672, Isaac Newton used the term spectrum to describe the bands of colours produced by a prism hit by light[17]. As rainbow colours are spawn from sunlight, in the same way individual sinusoidal components are spawn from our former signal.

Newton’s experiment consists of two prism, the second one placed upside down, sequen-tially crossed by a beam of white light. After passing the first one, light is decomposed and rainbow colours can be observed by inserting a split between the prism, but after passing the second prism, colours are mixed again and the original white light is obtained. First prism analyze light while second one mix its component back. Fourier Transform will give us the two prism to travel back and forth between our input signal and its frequency spectrum.

1.3.1 Mathematical Background

Before we begin, a rightful premise is needed.

Listing and properly describing all Fourier Transform properties is not a brief work: A so widely used and well known method requires a rigorous discussion which would un-avoidably fall outside the scope of this work, so we will only and quickly focus on some

(30)

(a) (b)

Figure 1.2: Newton’s experiment representation (1.2a) and its impact in contemporary art and popular culture: (1.2b)Cover art for Pink Floyd’s The Dark Side Of The Moon

key points, useful in better understanding the considered algorithm and its implementa-tion. For a more detailed discussion of Fourier Transform, reference is made to specialized courses and books as Luise,M. and Vitetta, G. Teoria dei segnali[21] (in italian) or Manolakis, D. Digital Signal Processing[17] and Applied digital signal processing[23].

Being said that, we will start by mathematically describing our two ”prisms” as a couple of equations which allow us to break down and then mix up again a continuous-time, pe-riodic signal x(t). Those equations are named, respectively, Fourier Synthesis Equation and Fourier Analysis Equationand are the following:

x(t) = ∞ X k=−∞ ckejkΩ0t ck= 1 T0 Z T0 x(t)e−jkΩ0tdt (1.3)

Where ckare known as the Fourier series coefficients. Both equation uses Euler’s

iden-tity to express sinusoids2_{and the angular frequency variable ω = 2πf. Notice that each}

term in the analysis equation is periodic with T0 =

2π Ω0

. Thus, if the infinite summation converges for all t, then its sum is also periodic with period T0, the fundamental period,

and with angular frequency of Ω0. The k − th addend of the sum is thus periodic of

Ω0

k with angular frequency kΩ0.

2_{Euler’s identity, e}± _{= cos φ ± j sin φ}_{, allows to express a sinusoidal signal in terms of complex}

expo-nentials with the same frequency, as follows: A cos(Ω0t + θ) = A 2e jθ_ejΩ0t₊A 2e −jθ_e−jΩ0t

(31)

Plotting x(t) over time (t) will provide us a description of the signal in the time-domain, while plotting ckas a function of frequency F = kF0 = k

Ω0

2π will describe the signal in the frequency domain. Figure 1.3a decomposes a wave into its sinusoidal components while figure 1.3b plots ck normalized magnitudes as a function of k. Notice how the different

components are clearly distinguishable.

Equations listed above are valid for continuous-time periodic signals and are thus ascribed

0 1 2 3 4 5 6 1 0 1 0 1 2 3 4 5 6 1.0 0.5 0.0 0.5 1.0 (a) (b)

to Continuous-Time Fourier Series (CTFS). For aperiodic signals Continuous-Time Fourier Transformis used, where synthesis and analysis equations become those following:

x(t) = Z +∞ −∞ X(j2πF )ej2πF tdF X(j2πF ) = Z +∞ −∞ x(t)e−j2πF tdt (1.4)

or, more concisely

x(t) = F−1(X(j2πF )) X(j2πF ) = F (x(t))

(1.5)

Where X(j2πF ) is for aperiodic signals what c(kF0)is for periodic signals, but while

the latter is a frequency-discrete function (i.e. is defined only for certain frequency val-ues multiples of F0), the first is defined everywhere on frequency domain and can thus be

properly defined as a spectrum. Periodic signals have indeed discrete spectra, with lines at harmonically related frequencies, while aperiodic signals have continuous spectra since almost all frequencies in a continuous interval are not harmonically related.

There’s of course a condition for the convergence of CTFT and CTFS. In Fourier Trans-form’s case, if x(t) has a finite number of minima, maxima and discontinuities in any finite

(32)

interval, and it is absolutely integrable, i.e Z +∞

−∞

|x(t)|dt < ∞

the signal ˆx(t) = F−_{1[X(j2πF )]} _{converges to x(t) for any t where x(t) is continuous}

while, at a discontinuity ˆx(t) is equal to the average of the values on either side of the discontinuity.

Unfortunately, both CTFS and CTFT are seldom usable in engineering applications, since almost all signal are nowadays digital signals, which means time and discrete-valued. New tools in Fourier’s toolbox are thus needed: Discrete-Time Fourier Series (DTFS) and Transform (DTFT).

We can consider a digital signal as a discrete sequence x[n] of values and describe DTFS fol-lowing the same approach used in CTFS: If x[n] is a linear combination of N harmonically related complex exponentials

x[n] = N −1 X k=0 ckej 2π Nkn

x[n]will be periodic with period of N while Fourier synthesis and Analysis equations will be the following: x[n] = N −1 X k=0 ckej 2π Nkn ck= 1 N N −1 X n=0 x[n]e−j2πNkn (1.6)

Is worth to quickly mention here the Parseval’s relation, which states that, in the case of DTFT, the average power in one period of x[n] can be expressed in terms of the Fourier series coefficients: Pav = 1 N N −1 X n=0 |x[n]|2 ₌ N −1 X n=0 |ck|2

Thus, values of |ck|2 provides the portion of the average power of x[n] contributed by its

k − thharmonic component.

Another important concept to point out is that the analysis and synthesis equations for DTFS, unlike those for CTFS and CTFT, involve the computation of a finite number of items using finite summations. Thus, DTFS can be exactly evaluated by numerical computation, not just approximated.

(33)

then Re(x[n]) = x[n] and Im(x[n]) = 0, then the following property will be held: X∗(ejω) = X(e−jω)

The passage from periodic signal to aperiodic signals, i.e. from DTFS to DTFT, leads to the following equations x[n] = 1 2π Z 2π X(ejω)ejωndω X(ejω) = ∞ X −∞ x[n]e−jωn (1.7)

where angular frequency is defined as ω = 2π n

Notice that if x[n] has finite duration, X(ejω₎_{can be computed at any frequency ω} k, but

the exact computation of the inverse DTFT,that is the synthesis equation, is not possible because it requires integration. However, what just described provides the basis for com-putational Fourier analysis and for defining a new transform, known as Discrete Fourier Transform(DFT).

Given N consecutive samples x[n] of a periodic or aperiodic sequence, analysis and syn-thesis equations for DFT are

X[k] = N −1 X n=0 x[n]W_Nkn x[n] = 1 N N −1 X k=0 X[k]W_N−kn (1.8)

where WN, known as twiddle factor, is defined as

WN

. = e−j2πN

It’s clear that DFT is discrete both in time and frequency and that for a N-points input se-quence it will output a N-points discrete spectrum with ωk= 2π_Nkfrequency points where

0 ≤ k ≤ N − 1. Furthermore, the N values of spectrum can be considered as a frequency domain sampling of the X(ejω₎_{spectrum obtained with DTFT. This is really helpful when}

it comes to understand zero padding technique.

Given a N finite-length sequence x[n] and its DFT X[k] = FDF T(x[n]), 0 ≤ k ≤ N − 1,

as soon as Nyquist condition is held3_{, the former sequence can be retrieved with Inverse}

3_{In Shannon’s original statement[35]:”If a function x(t) contains no frequencies higher than B hertz, it is}

completely determined by giving its ordinates at a series of points spaced _2B1 seconds apart. Thus for a given sample rate fsperfect reconstruction is guaranteed possible for a bandlimit B < f₂s”

(34)

DFT so that x[n] = F−1

DF T(X[k]). However, X[k] plot will not be so informative and

under-standable if N is a low value. Figure1.3 shows X[k] magnitude plot and the reconstructed signal F−1

DF T(X[k]), for a sampled sinusoid with N = 16.

0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 16 points fft 0 2 4 6 8 10 12 14 1.0 0.5 0.0 0.5 1.0 reconstructed signal

Figure 1.3: Example: 16-points FFT

A solution for getting a better visualization is using zero-padding technique: Zeroes are added in tail and head positions of x[n] so that a M-length sequence is obtained, where M > N, then the M-points frequency spectrum is calculated.

xzp[m] =              0, for m ≤ L x[m − L], for L < m ≤ L + N 0, for m ≥ L + N (1.9)

Figure1.4a shows the former and the zero padded sequences for a sinusoid with N = 16, M = 128, L = 56, while figure1.4b shows their DFTs, where the spectrum obtained by using 128 points uses a finer grid of equispaced frequencies and is therefore clearly more understandable and meaningful to human eye.

Notice that X[k] sequences follows this order: • k = 0 DC frequency component

• 0 < k < M

2 positive frequencies components

• k = M

2 Nyquist frequency component. Common to both positive and negative

(35)

0 2 4 6 8 10 12 14 1.0

0.5 0.0 0.5

1.0 reconstructed signal (16 points)

0 20 40 60 80 100 120 1.0

0.5 0.0 0.5

1.0 reconstructed signal (zero-padded 128 points)

(a) 0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 16 points fft 0 20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 128 points fft (b)

Figure 1.4: Visual improvements of zero padding

• M

2 < k ≤ M − 1negative frequencies components, inverse order

It is important to point out that, by using zero-padding, no additional information can be exploited by the processing algorithm. Zero padding allows a better representation for display purposes, but spectral resolution is not increased at all.

Furthermore, there’s a zero-padding’s effect that deserves to be shown: Our input signal is a sampled sinusoid and therefore a monochromatic signal which should have only two non-zero X[k] values, one for positive frequencies and one for negatives. While this is somehow proved in 16-points DFT, it’s not true anymore in 128-points DFT, where those non zero values seems to be spread among near frequencies (spectral spreading or smear-ing). Furthermore we can observe that the spectrum is nonzero everywhere, except for those frequencies which belong both to X[k] = FDF T(x[n])and Xzp[k] = FDF T(xzp[n])

spectra. This last phenomenon is known as spectral leakage.

Both spectral leakage and spectral spreading are related to the windowing operation per-formed on the input sequence (in our previous example a rectangular windowing) as can be seen in figures 1.5a and 1.5b. We won’t discuss further about those issues, since our point here is to provide only some basic information about Fourier Transform and zero padding. For a detailed study of those problems we refer to the previously mentioned books [23] [17].

(36)

0 20 40 60 80 100 120 1.0 0.5 0.0 0.5 1.0 rectangular window 0 20 40 60 80 100 120 0.5 0.0 0.5 1.0 Hann windowed (a) 0 20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 rectangular window 0 20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0 Hann window (b)

Figure 1.5: Different windowing and their effects

1.3.2 Algorithms

Unfortunately, computing DFT directly from its definition is often too slow to be practi-cal, that’s why Fast Fourier Transform (FFT) algorithms had been developed. An FFT can rapidly perform Fourier analysis with a reduced computing complexity (O(N log N)) in-stead of DFT’s (O(N2₎_).

Seeds of an FFT algorithm can be traced to Gauss’s unpublished works[13], but the first proper algorithm was published in 1965 by Cooley and Tukey[5]. Further algorithms came with time and their usage spread widely in all scientific fields, up to being included in Top 10 Algorithms of 20th Century by the IEEE magazine Computing in Science & Engineer-ing[16]

1.3.3 Short-Time Fourier Transform

Since our application case aims to real-time processing, Fourier Transform can’t be applied at once to the whole input signal. Low latency requirements, in fact, force us to split input signal into a sequence of shorter segments and then apply the Fourier Transform to those local sections. Thus revealing the spectrum of each one of those and how it changes over time, whereas standard Fourier Transform provides spectra’s average over the entire signal. Clearly, the shorter is the segment, the lower is the latency.

(37)

can be expressed as follows: STFT{x[n]}(m, ω)=. ∞ X n=−∞ x[n] · w[n − m]e−jωn

Where x[n] is the signal, w[n] a window function and m input frame’s index.

Input segments, in fact, are properly windowed and often overlap each other in order to reduce artifacts at the boundary.

The most widely accepted way of inverting the STFT is by using the overlap-add (OLA) method, an efficient way to evaluate the discrete convolution of a long signal with a fi-nite impulse response(FIR) filter h[n]. The concept is to divide the problem into multiple convolutions of h[n] with short segments of x[n] and then write y[n] as a sum of those convolutions so that: y[n] = x[n] ∗ h[n]=. ∞ X m=−∞ h[m] · x[n − m] = M X m=1 h[m] · x[n − m]

Where h[m] = 0 for m /∈ [1, M]. Therefore in OLA method, reconstructed signal is ob-tained by summing the IFFTs of the overlapping FFT segments. Figure 1.6 sketches the idea

(38)

(39)

Part II

(40)

(41)

Chapter 2 DE2: Board Overview

DE2 is a Development and Education board produced by Altera (now Terasic) from 2005 up to 2015[11]. It provides several features aimed at multimedia project development such as: 24-bit Audio CODEC (with line-in, line-out and mic-in), VGA DAC with VGA out connector, TV Decoder (NTSC/PAL) and TV in connector, IrDA transceiver and 10/100 Ethernet Controller. Besides this, it provides generic I/O peripherals too such as: 8MB SDRAM, 512KB SRAM, 4MB Flash, PS/2 mouse/keyboard connector, RS-232 Transceiver, Push-buttons and DPDT switches, LEDs and a LCD module. The real core of the board lies in the Altera Cyclone II 2C35 FPGA, which can be configured to proper interconnect and handle the aforementioned devices and IOs. Altera USB Blaster and EPCS16 Serial Config-uration device are used for configuring the FPGA, for programming and user API control. User can thus easily implement any system design which uses board’s features via Quartus II software. Cyclone II 2C35 FPGA is manufactured using 90-nm process and provides

• 33.216 Logic Elements

• 105 M4K RAM blocks (for a total of 483.840 RAM bits) • 35 embedded multipliers

• 4 PLL

• 475 user I/O pins

As stated in Cyclone II handbook[6], a Logic Element (LE) is the smallest unit of logic in Cyclone II architectureand it provides advanced features with efficient logic utilization. In LEs, as shown in 2.2, functions are generated via a four-input Look-Up Table(LUT) and

(42)

Figure 2.1: DE2 Board

synchronous logic enabled via programmable registers, which can be selected or bypassed in each LE by using proper configuration pins. Furthermore, registers can be configured for D, T, JK or SR operations. When registers are bypassed, LUT output can directly drive LE’s output, generating thus combinational, asynchronous logic.

Different routing resources are available on the device and each LE has three independent outputs, allowing thus the LUT to drive one output while the register drives another out-puts. A single LE can use its LUT and its register for two unrelated functions, improving device utilization. This feature is called register packing Another special packing mode al-lows the register output to feed back into the LUT of the same LE, while an additional routing output allows LEs to cascade their registers together.

LE are gathered in groups of 16, forming thus Logic Array Blocks, interconnected and arranged in a matrix structure to form the Cyclone II device.

Embedded memory structures are used to address the on-chip memory needs of FPGA designs. Those structures consists of columns of M4K memory blocks that can be config-ured to provide various memory functions such as RAM, first-in first-out(FIFO) buffers and ROM. Each block features 4.069 memory bits, plus 512 parity bits for a total of 4.608 bits per block. M4K block can operate up to 250MHz and their memory bits can be arranged in several configurations, such as:

(43)

Figure 2.2: Cyclone II Logic Element

• 4K × 1 • 2K × 2 • 1K × 4

• 512 × 8 or 512 × 9 (using parity bits as memory bits too) • 256 × 16 or 256 × 18 (idem)

• 128 × 32 or 128 × 36 (idem)

Power-up condition and register clear affects only the output register of the M4k block, so that at power up and after a reset, output is cleared but not the effective value of the memory cell. The M4K memory blocks support several modes of operations such as Sin-gle port, Simple dual-port, Shift register and so on. Simple Dual-Port memory blocks will be inferred for some buffer inside neural network module, so a proper description of it is deferred to Buffers section in Neural Network chapter (sec. 6.8).

Finally, each Cyclone II device has one to three columns of embedded multipliers that im-plement multiplication functions, and each embedded multiplier can perform both signed and unsigned multiplication and be configured as a single 18bit × 18bit multiplier or two 9bit × 9bitmultiplier. Beside this, Altera provides the user with IP cores, available as Quar-tus II megafunction, to instantiate and configure the multipliers.

(44)

Figure 2.3 shows as example a top view of a Cyclone II EP2C20 device, where its structure and main features are depicted.

DE2 Board, in conclusion, is not anymore a state-of-the-art devices. Nowadays, in fact,

Figure 2.3: Cyclone II EP2C20: Block Diagram

boards with newer FPGA Devices with mode LE can be found for a similar original market price(e.g DE2-115, which features a Cyclone IV FPGA with 115 LEs) and, sometimes, with additional processors (e.g. DE10 Standard with Cyclone V, 110K LE, and a dual-core Cortex A9 on board). Nonetheless its numerous features and the consequent versatility makes it still well usable for designing complex Digital Systems.

(45)

Chapter 3 Input Buffer

3.1 Working Principle

A proper buffer is needed as an interface between the two clock domains: audio clock (8KHz) and core clock (50MHz). Input Buffer module does that by taking the 8KHz in-coming 16-bit audio samples and forward them to the processing core in N-values burst at the core clock speed. Furthermore, considered algorithm [27] requires a 50% overlapping of consecutive frames fed to FFT, their windowing and zero-padding.

Input data are acquired continuously and then sliced into N-values, 50% overlapping frames. Each single frame, once acquired, is then weightened, zero padded and sent to fft.

Weighting is done by using an approximated Hann window so that

ˆ

x = x[n] · ˆwHann[n]

A more detailed explanation of this filtering is presented in section 8.2. Frame overlapping can be achieved by building the module as a three N

2-values-sections

circular buffer, where two sections are read while the other one is being written as depicted in figure 3.1.

A first condition to held is that no write-during-read errors must occurs, that is writing operation takes longer than reading operation.

N

2 · Taudio clk > N · Tcore clk

Since audio data acquisition works without stop, this condition ensures that a whole frame can be forwarded to the processing unit without errors due to new input values overwriting

(46)

the ones being transmitted. Considering N = 128, Taudio clk = 125µsand Tcore clk = 20ns

the inequality is easily proved valid as

8ms > 2.56µs

However, a further condition must be held for the system to work correctly: The real time condition

N

2 · Taudio clk > Tp (3.1)

Where Tpis the time required by the whole system to perform its work. A real time design

must comply with this condition, otherwise the system will be too slow and will partially drop the input data resulting in incorrect output.

Figure 3.1: Input buffer structure

3.2 Implementation

Implementation consists of a main module, inputBuffer, which acquires and stores audio samples every Taudio clock and a submodule, bufferSerializer which, when a new frame is

completely stored, forwards it to the following modules at core clock speed.

Here, as anywhere needed in this work, slower clocks have been achieved by feeding mod-ules with the highest frequency clock (50MHz core clock) and using enable logic to slow down working operations, avoiding clock gating.

inputBuffer continuously stores audio samples in a M = 3 · N

2 buffer implemented using M 4K embedded memory blocks1 and, when storing address reaches a section’s border,

(47)

a start serializer signal is triggered and sent to bufferSerializer. Thus every N

2 new audio

samples, a full frame is produced.

The initial case where border address is reached but not enough data have been stored yet is treated as a special case and no start signal is issued.

ADDR0 0

ADDR1 127 ADDR2 191

Table 3.1: Buffer’s border address

A status variable input writing section is used to teach bufferSerializer about which sec-tions must be outputted during reading operasec-tions. This variable changes, together with start serializer, when a border address is reached in writing operations and is then for-warded to bufferSerializer which samples it and sweep reading addresses according to it. When bufferSerializer receives start signal, an ack signal is sent back to inputBuffer and the zero-padded, windowed frame is output. Transmission works as follows

• fft start signal and first N

4 = 64zeroes are outputted

• windowed N values frame are outputted • last N

4 = 64zeroes are outputted

two different counters, one for zero-padded values and one for frame values, and a status variable zero pad are used to proper synchronize the output. Data are read from buffer and address is retrieved by using a counter updated according to input writing section and zero pad status variables. Windowing is performed on buffer’s output data using a Hann function’s approximation which uses only bit shifts and a sum, as better explained in sec-tion 8.2. Processed data are then output together with their in-frame address2_{. In order}

to comply with M4K blocks’ timing requirements, data are registered and outputted 2Tck

after the corresponding read address had been produced to buffer.

Notice that, with this implementation, the zero padding causes the write-during-read tim-ing condition to change

N

2 · Taudio clk > 2N · Tcore clk

(48)

But it’s still held as

8ms > 5.12µs

In order to reduce this time waste, a further implementation could use an earlier start signal to trigger zero-padding output while the frame is still being acquired.

Figure 3.2 depicts inputBuffer and buffer serializer working flow and table 3.2 reports whole module connections.

Figure 3.2: Input buffer module

dir port description

input clk core module clock

input reset module reset

input enable module enable

input clk audio en sample enable for audio input

input out en output enable

output[7:0] addr out address of current output output[15:0] data out current output data output fft start start signal for fft output output valid output valid

(49)

3.3 Simulation

Simulation results are shown in figures 3.3 and 3.4, which clearly show the different speed rate of input and output data. Audio data are continuously fed to data in port and are output via data out in fast burst ( at Tcore clock) only when a new frame is fully available.

Notice thus the periodicity of data out and fft start in 3.3.

Figure 3.3: Input buffer simulation

Fig 3.4 reports more in details core-speed outputs. Notice on data out the Hann-windowed and zero padded version of raw data carrieddata wire.

(50)

(51)

Chapter 4 FFT and IFFT

4.1 FFT and IFFT modules

FFT and IFFT modules used in this work are based on the 256 points pipelined FFT/IFFT IP by Unicore system as available on opencores.org[20]

FFT256 module performs a 256 complex points FFT using up to 16 bits data and coefficient widths. It features:

• Forward and inverse FFT

• radix-16 FFT algorithm

• Pipelined mode operation: each result is outputted in one clock cycle. Latency equal to 580 or 839 clock cycles (11.6µs or 16.78µs @ 50MHz) according to the output order used.

• Overflow detectors of intermediate and resulting data

• Two normalizing shifter stages

• Input and output data represented by nb ≤ 16and nb+ 4bit two-complement

com-plex integers, respectively.

• Twiddle coefficients are nw−bit wide numbers. nw ≤ 16

(52)

FFT algorithm is based on Divide et Impera principle: the former DFT is split into several smaller DFTs, which are calculated recursively and their solutions combined and re-sorted to build up the overall solution.

This module split former DFT into 16-points DFTs and uses Winograd algorithm to calculate them, obtaining thus a small number of additions and multiplications.

The main error source is the result truncation after multiplication to the twiddle factor, but the maximum result error is keep less than the 1 least significant bit of the input data. Core signals are represented in the following table

signal dir description

CLK input global clock

RST input global reset

START input FFT start

ED input input data and operation enable strobe DR[nb-1:0] input input data real sample

DI[nb-1:0] input input data imaginary sample SHIFT input Shift left code

RDY output Result ready strobe DOR[nb+3:0] output Output data, Real sample DIR[nb+3:0] output Output data, Imaginary sample ADDR[7:0] output Output address

OVF1 output Overflow flag OVF2 output Overflow flag

4.1.1 Workflow

Input data array starts to be inputted after the falling edge of the ST ART signal and are sampled at each rising edge of the clock when ED is high. ED can thus be used to control the throughput of the processor, as will be done in section 4.2. Result samples start to be outputted after the RDY signal, together with their address on ADDR port.

Input and output timings are depicted in figures 4.1a and 4.1b (taken from module’s user manual)

(53)

(a)

(b)

Figure 4.1: FFT Input and output timings

• U BUF1 and U BUF2 are data buffer, 512-complex values - 2 port synchronous RAM • U FFT1 and U FFT2 radix-16 and radix-8, respectively, DFT calculation units

• U NORM1 and U NORM2 normalizing units

• U MPU complex multiplier unit. Uses a twiddle factor ROM

• CNORM units adjust the proper signal bandwidth to prevent bit width increase in FFT computation

4.1.2 Simulation

Module had been tested and simulation results are reported here.

Figure 4.3a shows the input waveform used while figure 4.3b shows the comparison be-tween magnitude of fft obtained by the module under consideration and a the one obtained with a python script using numpy fft function[28]. Both magnitudes were normalized.

(54)

Figure 4.2: FFT block diagram 0 50 100 150 200 250 4000 2000 0 2000 4000 (a) 0 50 100 150 200 250 0.0 0.2 0.4 0.6 0.8 1.0 HW fft Python fft (b)

Figure 4.3: FFT test: input wave and fft’s output

4.2 FFT Slowdown

In order to correctly feed the output to the next module without using a buffering stage between, FFT needs to be slowed down.

CORDIC MP ESTIMATOR, in fact, requires 11 Tck cycles to calculate the magnitude and

the phase of the input complex number, so that a new input can be accepted only after nTck from the previous one, with n > 11. ED pin of FFT module offers the chance to do

that by enabling or disabling the module, so that during input acquisition and fft calculus operations ED can be set high (enable), allowing the module to work at maximum speed. During output operations ED can be set high for a single clock cycle when a new output can be accepted and then set back to zero for n clock cycles. This behaviour is obtained by driving ED with FFT SLOWDOWN module.

In FFT SLOWDOWN a finite state machine and a counter are used in order to identify the different status of FFT module and determine, when needed, ED’s value. Start signal

(55)

is the same signal that triggers FFT module and set FFT SLOWDOWN from IDLE state (state = 0) to a WAITING state (state = 2). When output is ready, FFT module sends an output ready signal to FFT SLOWDOWN (fft rdy). SLOW STATE (state = 3) is set as the new state and the periodic, slowed down enable signal (ED out) is outputted and fed back to FFT module. In this last state, two counter registers are used: cnt which counts the number of samples output by FFT, and delay cnt which counts the clock cycles and, when its value equals DELAY parameter, reset itself and set high ED out for a clock cycle. In our case DELAY = 11 and delay cnt is a 4-bit counter.

FFT output’s addresses are sampled and delayed so that, when CORDIC MP ESTIMATOR output is ready, FFT’s addr out holds the address related of the current magnitude output.

Figure 4.4 shows the timing diagram for the module: input ports ED out and fft start are depicted, together with outputs ED out, start cordic and the internal register contain-ing the current state. Figure 4.5 shows instead the synchronization between FFT SLOWDOWN addr outport and CORDIC MP ESTIMATOR output. Notice that addr in, start cordic, addr out belong to FFT SLOWDOWN module while the other belong to cordic estimator module. Table 4.1 shows connections of the module.

Figure 4.4: FFT Slowdown module: timing diagram

Figure 4.5: FFT Slowdown module: address timing diagram

4.3 IFFT Slowdown

With a very small amount of imagination, IFFT SLOWDOWN acts as an enable driver for IFFT module just as FFT SLOWDOWN does for FFT. Their behaviour is similar but reversed,

(56)

in clk global clock

in reset global reset

in start FFT start signal

in fft rdy FFT output ready signal

out start cordic enable sampling for cordic module out ED out enable for FFT module

in addr in[7:0] FFT output’s address

out addr out[7:0] FFT output’s address forwarding parameter DELAY Counter value for ED generation

Table 4.1: FFT Slowdown connections

since IFFT SLOWDOWN slows down the module in the data acquisition phase in order to correctly receive data from CORDIC CMPLX REC. The computation and output phase is instead performed at full speed with ED constantly set high. The only remarkable differ-ence from FFT SLOWDOWN is that in this case no counter has been used and the correct result was achieved instead by forwarding output ready signal from CORDIC CMPLX REC. Figure 4.6 shows timing diagrams for acquisition phase of IFFT SLOWDOWN and Figure 4.7 depict CORDIC CMPLX REC outputs too. Notice that the address feed to IFFT module (ifft in addr) changes 1Tcklater than ED out in order to allow a proper sampling.

Figure 4.6: IFFT Slowdown module: timing diagram

(57)

in clk global clock

in reset global reset

in start start signal

in cordic done CORDIC output ready signal out start cordic enable sampling for cordic module out ED out enable for IFFT module

in ifft start IFFT start signal out ifft in addr[7:0] IFFT input’s address

(58)

(59)

Chapter 5 Magnitude and phase estimation

Once frequency-domain samples are outputted, their magnitude and phase need to be calculated and fed to the neural network in order to obtain the magnitude spectrum of the cleaned-up signal. Since a straightforward implementation of the magnitude formula1

would have required the usage of DE2’s DSP blocks, different approaches have been evalu-ated in order to save those DSP units and still having a good approximation. In fact, many other speech enhancement algorithms use further masking operations which could easily accomplished with such DSP blocks, thus sparing them in this phase could be worthwhile in future implementation of other algorithms.

The two approaches are the following: • Weighted Sum

• CORDIC Algorithm

5.1 Weighted Sum

The Weighted Sum approach is based on the following formula presented in R.Lyons, Un-derstanding Digital Signal Processing[22]

ka + jbk ≈ α · max(a, b) + β · min(a, b)

Some simulations were made using the following values for alpha and beta, both chosen for being a linear combination of 2 powers, allowing us to perform the weighing operation

(60)

by simply shifting input’s bits and summing the resulting vectors avoiding any use of DSP blocks.

α = 1 = 20, β = 0.3125 = 2−2+ 2−4

Simulations have been run on (4096 × 4096) points, showing us an average error of 3.23% with a peak on small numbers ( a, b < 100, small enough to be considered as noise on the whole input range2_{) and zeroes in the following cases}

• a = 0 or b = 0 • a = 2 · β

1 − β2 · b = 0.6926 · b (b > a)

Beside the aforementioned peak, the maximum error occurs obviously when a = b ka + jbk = √2 · b = 1.414 · b, r0 = 1.3125 · b

thus 1.4142 − 1.3125

1.4142 ≈ 7.2%

(5.1)

(a) (b)

Figure 5.1: Weighted Sum

Unfortunately this method does not offer any information about the phase of the former number, which placed this choice to a second place.

5.2 CORDIC Algorithm

The CORDIC algorithm[40] (acronym for COordinate Rotation DIgital Computer), also known as Volder’s algorithm, allows us to retrieve at the same time both phase and mag-nitude of a complex number, making our choice fell on it despite its major computational

(61)

complexity. The working principle is basically, given a complex number, to rotate it by ever smaller angle until it became, with good approximation, a real number R. At this point, former number’s magnitude is given by the final R value (except for a gain factor), while the phase information is given by the directions of the rotations performed.

(a)

Figure 5.2: CORDIC algorithm’s working principle. First three rotations represented:-45°,

-22.5°,+11.25°

The strength of this algorithm is the fact that rotations can be obtained by multipli-cations and, therefore, by bit shifts if those multiplimultipli-cations involve two-powers. So, for instance, given a complex number a + jb and a rotation on complex plane 1 + jL.

(a + jb)(1 + jL) = (a − L · b) + j(b + L · a)

By assuming L = −sgn(b) · 2i_{, i ≤ 0}, then both L · b and L · a can be done via a right bit

shift of a and b while the sign of L depends only on b.

This algorithm, as stated here, transform a complex number laying on the first quarter ( a ≥ 0, b ≥ 0) into a real, positive number and into a track of the rotations performed. For numbers belonging to other quarters a first operation of quarter detection is needed. After storing that value in the rotations track, the number is rotated of 90°, 180° or 270° to lay on the first quarter. Hence the proper algorithm starts.

Unfortunately each of these multiplications introduces a magnitude gain of k1 + jLk which has to be taken in account. As reported in Volder’s original paper [40], for 24 rota-tions the overall gain is

(62)

23

Y

i=0

k1 + j2−ik = 1.646760255

The following results are obtained by simulating the algorithm on 2048 per 2048 points and with a fixed number (16) of rotations. The average magnitude error (net of CORDIC’s gain) is of 0.68% while the average error on phase is less than 0.01%. Since integer arith-metic’s limitations, CORDIC’s algorithm too shows a peak on small number. Further simu-lations (figures 5.3c and 5.3d) were made to investigate the best trade-off between accuracy and number of rotations.

(a) Magnitude Error % (b) Phase Error (rad)

(c) Magnitude Error(%) vs #rotations (d) Phase Error (%) vs #rotations

Figure 5.3: CORDIC Algorithm

Given its good performance and the chance of a simultaneous calculation of magnitude and phase, the final choice fell on CORDIC algorithm.

5.3 CORDIC Algorithm’s Implementation

As a design choice, given the simulation results presented in the previous section, a fixed number of 8 rotations plus the quarter detecting one has been chosen as a reasonable

(63)

trade-off between throughput and output accuracy of both magnitude-and-phase estima-tion module and cartesian form recovery module. This design choice brings us a CORDIC gain of 1.646743507, approximated as 1.64102564 = [2−1_{+ 2}−4_{+ 2}−5_{+ 2}−6_]−1_.

Hardware design of both modules will be described in the following sections.

5.3.1 Magnitude and Phase Estimator

With reference to figure 5.4, a description of the implemented algorithm’s workflow is the following:

After start signal, CORDIC MP ESTIMATOR samples input data and then, one clock cycle later, complex plane quarter of the incoming complex number is detected and stored in reg-ister rot seq (which holds track of the rotations, therefore phase information of the former number) while the absolute values of both real and imaginary part are fed into the cordic core logic. Combinational logic calculates the proper weighted value of a and b basing on the values stored in a cordic in and b cordic in and on the number of rotations already performed, while sequential logic updates the values of a cordic in and b cordic in.

a cordic in = (a cordic in − L · b cordic in)

b cordic in = (b cordic in + L · a cordic in)

When all 8 rotations have been performed, magnitude value is divided by the cordic gain and then output together with rot seq and out rdy signal. The special case in which a = 0 or b = 0is taken into account by outputting modulus of the non-zero part and raising a jump gainflag in rot seq, which instructs the CORDIC CMPLX REC to not perform cordic gain removal on that sample. No privileged policy was used in the aforementioned special case and the number of clock cycles required is the same of the non-zero case.

The proposed hardware architecture for the core of the algorithm is the one depicted in 5.5.

Start signal and input data are fed together to the module, then quarter and jump gain are detected and stored while the absolute values of a and b are selected by the multiplexer and sent to the shift and sign blocks, which perform the weighted values calculation. enable calc is a status variable as i, which holds the number of rotations performed. a00 _{and b}00 _are

(64)

Figure 5.4: CORDIC implementation’s flow diagram

enable calchas changed. When all rotations are performed, output registers are enabled, results are outputted and enable calc reset.

Figure 5.5: CORDIC MP ESTIMATOR architecture

Fig. 5.6 shows simulation results. Signals a cordic in and b cordic in are what we called a´and b´ in 5.5. The time needed for an operation is 10 clock cycles from the start signal.

(65)

Figure 5.6: Simulation Results

5.3.2 Phase Recovery

CORDIC CMPLX RECmodule was designed to do the inverse operation, obtaining a com-plex number in the cartesian form from its magnitude and phase information. Figure 5.7 describes the workflow of CORDIC CMPLX REC.

Figure 5.7: CORDIC CMPLX REC workflow

As in CORDIC MP ESTIMATOR, input data are sampled when a start signal is issued then, if jump gain flag is not set, the incoming magnitude is preemptively multiplied by

1

cordic gain to prevent rotation’s gain from affecting the final result. Once all rotations are

performed, the signs of real part and imaginary part are restored from rot seq and finally results are output. Simulation is provided in fig. 5.8.

(66)

Figure 5.8: CORDIC CMPLX REC simulation

Since its architecture is quite similar to the one of CORDIC MP ESTIMATOR, the schematic for CORDIC CMPLX REC is not provided here.

5.3.3 Simulation of modules’ chain

To better investigate the behaviour of both modules, a further simulation has been done in which the incoming numbers are first converted into magnitude and phase and then turned back again into cartesian form with the described modules.

Results are shown in fig. 5.9 and in table 5.2

Figure 5.9: CORDIC chain simulation

5.3.4 Results on FFT

As a further investigation, we decided to verify the cordic algorithm by using it to get the magnitude of an audio file’s spectrogram. We used the stft function from Python’s library librosa[25] and an audio file from Wall Street Journal dataset WSJ0. To avoid the noisy zone of Cordic’s algorithm we had to multiply the output of stft to better fit Cordic’s input range. We arbitrarily chose, with good results, a 2048 factor so, approximately remapping ranges (−10, +10) → (−20000, +20000)

(67)

(a) CORDIC algorithm’s results on FFT

FPGA Implementation of a Low Latency, Neural Network-Based Speech Enhancement Algorithm