Unsupervised learning

(1)

Kohonen SOMs cluster data, i.e., they identify clouds of data which are densely distributed in specific regions, according to which the input space can be partitioned.

• As in the k-means algorithm, each partition can be represented by the centroid of the cloud: in a SOM it

corresponds to the weights associated with a neuron of the map. Conversely, each weight vector associated with a neuron of a trained SOM is a centroid for a data cloud.

• It is possible to classify data a posteriori based on the partition of the input space to which they belong. Supposing we have a labeled data set, each partition induced by the SOM can be labeled after the class for which the corresponding neuron is most frequently the winner neuron (stacked generalization)

Unsupervised learning

Kohonen’s Self-Organizing Map (SOM): summary

(2)

In practice, to label neurons of a SOM a posteriori according to a labeled data set X (stacked generalization/majority vote) :

label(i) = argmax hist(winner(X, w_i),L) where

• X is the dataset represented as a matrix (each row x_i is a pattern).

• W are the SOM weights represented as a matrix (row w_iis the weight set of the i-th neuron

• hist (v, L) (see the Matlab function) is the frequency histogram of vector v, whose elements may have L different values (from 1 to L).

• winner (X, w_i) returns a vector J having the same number of

elements as the rows of X (i.e., the number of data in the dataset):

Element J_i is the label of the winning neuron when x_i is input.

Unsupervised learning

Kohonen’s Self-Organizing Map (SOM): summary

(3)

• With respect to k-means, a SOM adds an ordering of the

centroids which preserves the topological properties of the input space of which the SOM can be considered a lower-dimensional projection.

• The fact that neighboring neurons respond to similar

(neighboring) patterns creates a lattice whose nodes are located where data are actually found and whose arcs never cross, as if a net was casted over the pattern space and its nodes were orderly anchored in the middle of regions where data are densely

present.

Somehow, it creates a distorted re-projection of data onto a lower-dimensional space where, however, data are (not strictly but surely more) uniformly distributed.

Unsupervised learning

Kohonen’s Self-Organizing Map (SOM): observations

(4)

• It can be demonstrated (see Bishop’s book) that the algorithm for training a SOM with neighborhood radius decreasing with time is equivalent to the K-means algorithm.

• The weight update equation can be derived as the (gradient- based) solution of the ‘vector quantization problem’:

• given K, find a set of K centroids (codebook in TLC terms; each centroid is then termed codebook vector) that minimize the

squared error (actually it holds for any exponent) made when each sample in a (large) dataset is substituted by the closest centroid.

Unsupervised learning

Kohonen’s Self-Organizing Map (SOM): observations

(5)

Unsupervised Learning

(6)

Until now, we have assumed our training samples are

“labeled” by their category membership.

Methods that use labeled samples are said to be supervised.

However, there are problems where the definition of the

classes and even the number of the classes may be unknown.

Machine learning methods which deal with such data are said to be unsupervised.

Questions:

Why would one even be interested in learning from unlabeled samples?

Is it even possible in principle to learn anything of value from unlabeled samples?

Unsupervised learning

(7)

1. To limit the cost of the often surprisingly costly process of collecting and labeling a large set of sample patterns.

E.g., videos are virtually free, but accurately labeling the video pixels is expensive and time consuming.

2. To obtain a larger training set than the one available using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make it run without supervision on a large, unlabeled set.

Or, in the reverse direction, let a large set of unlabeled data group automatically, then label the groupings found.

3. To detect a gradual change of patterns over time.

4. To find features that will be useful for categorization.

5. To gain insight into the nature or structure of the data during the early stages of an investigation.

Why unsupervised learning

(in random order)

(8)

In practice, unsupervised learning methods implement what is usually referred to as data clustering.

Qualitatively and generally, the problem of data clustering can be defined as:

• Grouping of objects into meaningful categories

• Given a representation of N objects, find k clusters based on a similarity measure.

Unsupervised learning: clustering

(9)

The problem can be tackled from several points of view:

Statistics: represent the density function for all data as the mixture of a number of different distributions

S

_i

w

_i

p

_y|wi

(y|w

_i

)

and fit a set of weights w_i

and component densities p

_y|wi

to the given data

Data clustering

(10)

The problem can be tackled from several points of view:

“Geometry/topology”:

Partition the pattern space such that data belonging to each partition are highly homogeneous (i.e., they are similar to one another)

More directly related with classification:

Group (label) data such that average intra-group distance is minimized and average inter-group distance is maximized (yet another optimization problem!)

Data clustering

(11)

Why data clustering?

Natural Classification: degree of similarity among species.

Data exploration: discover underlying structure, generate hypotheses, detect anomalies.

Data Compression: for organizing/indexing/storing/broadcasting data.

Applications: useful to any scientific field that collects data!

Relevance: 2340 papers about data clustering indexed in Scopus in 2014!

Data clustering

(12)

Data clustering: examples

800,000 scientific papers clustered into 776 topics based on how often

the papers were cited together by the authors of other papers

(13)

Given a set of N unlabeled examples D = (x

₁

; x

₂

; …; x

_N

) in a d-dimensional feature space, D is partitioned into a number of disjoint subsets Dj 's:

D = 

_j=1,k

Dj where i ≠ j  D

_i

 D

_j

= 

where the points in each subset are similar to each other according to a given criterion .

Data clustering

(14)

A partition is denoted by

p = (

D

₁

;D

₂

; .. ; D

_k)

and the problem of data clustering is thus formulated as p^* = argmin f(p)

p

where f() is formulated according to



.

Data clustering

(15)

A general optimization (minimization) algorithm for a

classification function J(Y, W) (Y being the dataset and W the ordered set of labels assigned to each sample) can be

described as follows:

Choose an initial classifier W⁰ repeat (step i)

Change the classification such that J decreases until the classification is the same as the previous one.

If the variables were continuous, a gradient method could be used. However, here, we are dealing with discrete variables.

Data clustering

(16)

A reasonable algorithm (based on the simplifying assumption that the optimization problem is separable, i.e., that the

minimum of an n-dimensional function can be found by

minimizing it along each dimension separately) would assign to each sample the label that causes the largest (negative) J.

NB Since the problem is non-separable, there is no guarantee that J decreases as the sum of J’s . It may even increase!

A better but slower solution, which guarantees monotonicity, would be to change, in each step, the label that causes the

greatest negative J.

Data clustering

(17)

K-means clustering is developed by choosing the Euclidean distance as the similarity criterion  and

J =

S

_k=1,Nc

S

^y⁽ⁱ⁾_~^wk |y⁽ⁱ⁾ - m_k| as the function to be optimized.

y⁽ⁱ⁾ is the i^th sample and m_kis the centroid of the k^th cluster.

y

⁽ⁱ⁾_~

w

_k

refers to all samples y

⁽ⁱ⁾

assigned to cluster k

Then M  {m₁ , m₂, .. , m_Nc} is the set of reference vectors, each of which represents the ‘prototype’ for a class.

J is minimized by choosing m_kas the sample mean of the data having label

w

_k

.

K-means clustering

(18)

In practice:

• The algorithm partitions the input space S into a pre- defined number of subspaces, induced by the Euclidian distance.

• Each subspace s_i of S is defined as:

s_i ={x_jϵS | d(x_j,m_i) = min_td(x_j,m_t)}

This induces a so-called

Voronoi tessellation of the input space (example limited to 2D patterns)

Unsupervised learning

K-means clustering

(19)

Randomly initialize m  {m_1,m_2,m_3,m_{4, ...,}m_Nc}

repeat

Classify N samples according to nearest m_i

Recompute m_i(as the mean of the patterns assigned to cluster i)

until there is no change in m

K-means clustering

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)