• Non ci sono risultati.

Reti neurali

N/A
N/A
Protected

Academic year: 2021

Condividi "Reti neurali"

Copied!
31
0
0

Testo completo

(1)

[Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]  Threshold units  Gradient descent  Multilayer networks  Backpropagation

 Hidden layer representations

 Example: Face Recognition

(2)

Connectionist Models

Consider humans:

 Neuron switching time ~

:

001 second

 Number of neurons ~ 10

10

 Connections per neuron ~ 10

4?5  Scene recognition time ~

:

1 second

 100 inference steps doesn't seem like enough

! much parallel computation

Properties of arti cial neural nets (ANN's):

 Many neuron-like threshold switching units

 Many weighted interconnections among units

 Highly parallel, distributed process

(3)

When to Consider Neural Networks

 Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)

 Output is discrete or real valued  Output is a vector of values

 Possibly noisy data

 Form of target function is unknown

 Human readability of result is unimportant

Examples:

 Speech phoneme recognition [Waibel]

 Image classi cation [Kanade, Baluja, Rowley]  Financial prediction

(4)

ALVINN drives 70 mph on highways

Sharp Left Sharp Right 4 Hidden Units 30 Output Units 30x32 Sensor Input Retina Straight Ahead

(5)

Perceptron

w1 w2 wn w0 x1 x2 xn x0=1 . . .

Σ

Σ wi xi AA n i=0 1 if > 0 -1 otherwise

{

o = Σ wi xi n i=0

o

(

x

1

;:::;x

n) = 8 > > < > > : 1 if

w

0 +

w

1

x

1 +  +

w

n

x

n

>

0 ?1 otherwise.

Sometimes we'll use simpler vector notation:

o

(

~x

) = 8 > > < > > : 1 if

~w



~x >

0 ?1 otherwise.

(6)

Decision Surface of a Perceptron

x1 x2 + + -+ -x1 x2 (a) (b) -+ -+

Represents some useful functions

 What weights represent

g

(

x

1

;x

2) =

AND

(

x

1

;x

2)?

But some functions not representable

 e.g., not linearly separable

(7)

Perceptron training rule

w

i

w

i + 

w

i where 

w

i =



(

t

?

o

)

x

i Where: 

t

=

c

(

~x

) is target value 

o

is perceptron output

(8)

Perceptron training rule

Can prove it will converge

 If training data is linearly separable

(9)

Gradient Descent

To understand, consider simpler linear unit, where

o

=

w

0 +

w

1

x

1 +

 +

w

n

x

n

Let's learn

w

i's that minimize the squared error

E

[

~w

]  1 2 X d2D (

t

d ?

o

d) 2

(10)

Gradient Descent

-1 0 1 2 -2 -1 0 1 2 3 0 5 10 15 20 25 w0 w1 E[w] Gradient r

E

[

~w

]  2 6 4

@E

@w

0

; @E

@w

1

;



@E

@w

n 3 7 5 Training rule: 

~w

= ?



r

E

[

~w

] i.e., 

w

i = ?

@E

@w

i

(11)

Gradient Descent

@E

@w

i =

@w

@

i 1 2 X d (

t

d ?

o

d) 2 = 12 X d

@

@w

i (

t

d ?

o

d) 2 = 12 X d 2(

t

d ?

o

d)

@

@w

i (

t

d ?

o

d) = X d (

t

d ?

o

d)

@

@w

i (

t

d ?

~w



~x

d)

@E

@w

i = X d (

t

d ?

o

d)( ?

x

i;d)

(12)

Gradient Descent

Gradient-Descent(

training examples;

)

Each training example is a pair of the form

h

~x;t

i, where

~x

is the vector of input values,

and

t

is the target output value.



is the learning rate (e.g., .05).

 Initialize each

w

i to some small random value  Until the termination condition is met, Do

{

Initialize each 

w

i to zero.

{

For each h

~x;t

i in

training examples

, Do  Input the instance

~x

to the unit and

compute the output

o

 For each linear unit weight

w

i, Do



w

i 

w

i +



(

t

?

o

)

x

i

{

For each linear unit weight

w

i, Do

w

i

w

i + 

w

i

(13)

Summary

Perceptron training rule guaranteed to succeed if

 Training examples are linearly separable  Suciently small learning rate



Linear unit training rule uses gradient descent

 Guaranteed to converge to hypothesis with

minimum squared error

 Given suciently small learning rate



 Even when training data contains noise

(14)

Incremental (Stochastic) Gradient Descent

Batch mode

Gradient Descent: Do until satis ed

1. Compute the gradient r

E

D[

~w

]

2.

~w

~w

?



r

E

D[

~w

]

Incremental mode

Gradient Descent: Do until satis ed

 For each training example

d

in

D

1. Compute the gradient r

E

d[

~w

] 2.

~w

~w

?



r

E

d[

~w

]

E

D[

~w

]  1 2 X d2D (

t

d ?

o

d) 2

E

d[

~w

]  1 2(

t

d ?

o

d) 2

Incremental Gradient Descent can approximate

Batch Gradient Descent arbitrarily closely if



(15)

Multilayer Networks of Sigmoid Units

F1 F2

head hid who’d hood

(16)

Sigmoid Unit

w1 w2 wn w0 x1 x2 xn x0 = 1 A A A A A . . .

Σ

net = iΣ=0 wi xi n 1 1 + e-net o = σ(net) =



(

x

) is the sigmoid function 1 1 +

e

?x

Nice property: d(x)

dx =



(

x

)(1

?



(

x

))

We can derive gradient decent rules to train

 One sigmoid unit

 Multilayer networks of sigmoid units !

(17)

Error Gradient for a Sigmoid Unit

@E

@w

i =

@w

@

i 1 2 X d2D (

t

d ?

o

d) 2 = 12 X d

@

@w

i (

t

d ?

o

d) 2 = 12 X d 2(

t

d ?

o

d)

@

@w

i (

t

d ?

o

d) = X d (

t

d ?

o

d) 0 B @ ?

@o

d

@w

i 1 C A = ? X d (

t

d ?

o

d)

@o

d

@net

d

@net

d

@w

i But we know:

@o

d

@net

d =

@

(

net

d)

@net

d =

o

d(1 ?

o

d)

@net

d

@w

i =

@

(

~w



~x

d)

@w

i =

x

i;d So:

@E

@w

i = ? X d2D (

t

d ?

o

d)

o

d(1 ?

o

d)

x

i;d

(18)

Backpropagation Algorithm

Initialize all weights to small random numbers. Until satis ed, Do

 For each training example, Do

1. Input the training example to the network and compute the network outputs

2. For each output unit

k



k

o

k(1

?

o

k)(

t

k

?

o

k)

3. For each hidden unit

h



h

o

h(1 ?

o

h) X k2outputs

w

h;k



k

4. Update each network weight

w

i;j

w

i;j

w

i;j + 

w

i;j

where

(19)

More on Backpropagation

 Gradient descent over entire network weight

vector

 Easily generalized to arbitrary directed graphs  Will nd a local, not necessarily global error

minimum

{

In practice, often works well (can run multiple times)

 Often include weight momentum



w

i;j(

n

) =



j

x

i;j +



w

i;j(

n

? 1)  Minimizes error over training examples

{

Will it generalize well to subsequent examples?

 Training can take thousands of iterations !

slow!

(20)

Learning Hidden Layer Representations

Inputs Outputs A target function: Input Output 10000000 ! 10000000 01000000 ! 01000000 00100000 ! 00100000 00010000 ! 00010000 00001000 ! 00001000 00000100 ! 00000100 00000010 ! 00000010 00000001 ! 00000001

(21)

Learning Hidden Layer Representations

A network: Inputs Outputs

Learned hidden layer representation:

Input Hidden Output

Values 10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001

(22)

Training

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 500 1000 1500 2000 2500

(23)

Training

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500

(24)

Training

-5 -4 -3 -2 -1 0 1 2 3 4 0 500 1000 1500 2000 2500 Weights from inputs to one hidden unit

(25)

Convergence of Backpropagation

Gradient descent to some local minimum

 Perhaps not global minimum...

 Add momentum

 Stochastic gradient descent

 Train multiple nets with di erent inital weights

Nature of convergence

 Initialize weights near zero

 Therefore, initial networks near-linear

 Increasingly non-linear functions possible as

(26)

Expressive Capabilities of ANNs

Boolean functions:

 Every boolean function can be represented by

network with single hidden layer

 but might require exponential (in number of

inputs) hidden units Continuous functions:

 Every bounded continuous function can be

approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

 Any function can be approximated to arbitrary

accuracy by a network with two hidden layers [Cybenko 1988].

(27)

Over tting in ANNs

0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 0 5000 10000 15000 20000 Error

Number of weight updates Error versus weight updates (example 1)

Training set error Validation set error

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 1000 2000 3000 4000 5000 6000 Error

Number of weight updates Error versus weight updates (example 2)

Training set error Validation set error

(28)

... ...

left strt rght up

30x32 inputs

Typical input images

(29)

... ...

left strt rght up

30x32 inputs

Learned Weights

Typical input images

(30)

Alternative Error Functions

Penalize large weights:

E

(

~w

)  1 2 X d2D X k2outputs (

t

kd ?

o

kd) 2 +

X i;j

w

2 ji

Train on target slopes as well as values:

E

(

~w

)  1 2 X d2D X k2outputs 2 6 6 6 4(

t

kd ?

o

kd) 2 +



X j2inputs 0 B B @

@t

kd

@x

j d ?

@o

kd

@x

j d 1 C C A 2 3 7 7 7 5

Tie together weights:

(31)

Recurrent Networks

x(t) x(t) c(t) x(t) c(t) y(t) b y(t + 1)

Feedforward network Recurrent network

Recurrent network unfolded in time y(t + 1) y(t + 1) y(t – 1) x(t – 1) c(t – 1) x(t – 2) c(t – 2) (a) (b) (c)

Riferimenti

Documenti correlati

Alluvial and megafan deposits (Upper Pleistocene) Depositi fluviali e di megaconoide.

Notation

• Regola 3: Anche se le stringhe di formato della scanf() assomigliano molto a quelle della printf(), hanno una semantica leggermente differente (leggete il manuale!). • Regola

If no bony structures are affected and the clothing consists only of a T-shirt, a relatively small amount of force is required to inflict the injury, provided that a pointed,

In a Bayesian framework, the issue of whether the sup-norm- concentration-of-posterior-measure approach proposed by Gin´e and Nickl (2011), which involves solving a testing

La fase di apprendimento delle reti neurali, tramite l’algoritmo di back-propagation, `e in sostanza la soluzione iterativa di un problema di minimi quadrati per mezzo

• FileOutputStream implementa questi metodi nel caso parti- colare in cui l’output è un file, il cui nome è passato al costrut- tore di FileOutputStream ; in alternativa si

whose name is distinct from all other existing files, and opens the file for binary writing and reading (as if the mode string &#34;wb+&#34; were used in an fopen() call). If