Kohonen SOMs cluster data, i.e., they identify clouds of data which are densely distributed in specific regions, according to which the input space can be partitioned.
• As in the k-means algorithm, each partition can be represented by the centroid of the cloud: in a SOM it
corresponds to the weights associated with a neuron of the map. Conversely, each weight vector associated with a neuron of a trained SOM is a centroid for a data cloud.
• It is possible to classify data a posteriori based on the partition of the input space to which they belong. Supposing we have a labeled data set, each partition induced by the SOM can be labeled after the class for which the corresponding neuron is most frequently the winner neuron (stacked generalization)
Unsupervised learning
Kohonen’s Self-Organizing Map (SOM): summary
In practice, to label neurons of a SOM a posteriori according to a labeled data set X (stacked generalization/majority vote) :
label(i) = argmax hist(winner(X, wi),L) where
• X is the dataset represented as a matrix (each row xi is a pattern).
• W are the SOM weights represented as a matrix (row wi is the weight set of the i-th neuron
• hist (v, L) (see the Matlab function) is the frequency histogram of vector v, whose elements may have L different values (from 1 to L).
• winner (X, wi) returns a vector J having the same number of
elements as the rows of X (i.e., the number of data in the dataset):
Element Ji is the label of the winning neuron when xi is input.
Unsupervised learning
Kohonen’s Self-Organizing Map (SOM): summary
• With respect to k-means, a SOM adds an ordering of the
centroids which preserves the topological properties of the input space of which the SOM can be considered a lower-dimensional projection.
• The fact that neighboring neurons respond to similar
(neighboring) patterns creates a lattice whose nodes are located where data are actually found and whose arcs never cross, as if a net was casted over the pattern space and its nodes were orderly anchored in the middle of regions where data are densely
present.
Somehow, it creates a distorted re-projection of data onto a lower-dimensional space where, however, data are (not strictly but surely more) uniformly distributed.
Unsupervised learning
Kohonen’s Self-Organizing Map (SOM): observations
• It can be demonstrated (see Bishop’s book) that the algorithm for training a SOM with neighborhood radius decreasing with time is equivalent to the K-means algorithm.
• The weight update equation can be derived as the (gradient- based) solution of the ‘vector quantization problem’:
• given K, find a set of K centroids (codebook in TLC terms; each centroid is then termed codebook vector) that minimize the
squared error (actually it holds for any exponent) made when each sample in a (large) dataset is substituted by the closest centroid.
Unsupervised learning
Kohonen’s Self-Organizing Map (SOM): observations
Unsupervised Learning
Until now, we have assumed our training samples are
“labeled” by their category membership.
Methods that use labeled samples are said to be supervised.
However, there are problems where the definition of the
classes and even the number of the classes may be unknown.
Machine learning methods which deal with such data are said to be unsupervised.
Questions:
Why would one even be interested in learning from unlabeled samples?
Is it even possible in principle to learn anything of value from unlabeled samples?
Unsupervised learning
1. To limit the cost of the often surprisingly costly process of collecting and labeling a large set of sample patterns.
E.g., videos are virtually free, but accurately labeling the video pixels is expensive and time consuming.
2. To obtain a larger training set than the one available using semi-supervised learning.
Train a classifier on a small set of samples, then tune it up to make it run without supervision on a large, unlabeled set.
Or, in the reverse direction, let a large set of unlabeled data group automatically, then label the groupings found.
3. To detect a gradual change of patterns over time.
4. To find features that will be useful for categorization.
5. To gain insight into the nature or structure of the data during the early stages of an investigation.
Why unsupervised learning
(in random order)In practice, unsupervised learning methods implement what is usually referred to as data clustering.
Qualitatively and generally, the problem of data clustering can be defined as:
• Grouping of objects into meaningful categories
• Given a representation of N objects, find k clusters based on a similarity measure.
Unsupervised learning: clustering
The problem can be tackled from several points of view:
Statistics: represent the density function for all data as the mixture of a number of different distributions
S
iw
ip
y|wi(y|w
i)
and fit a set of weights wi
and component densities p
y|wito the given data
Data clustering
The problem can be tackled from several points of view:
“Geometry/topology”:
Partition the pattern space such that data belonging to each partition are highly homogeneous (i.e., they are similar to one another)
More directly related with classification:
Group (label) data such that average intra-group distance is minimized and average inter-group distance is maximized (yet another optimization problem!)
Data clustering
Why data clustering?
Natural Classification: degree of similarity among species.
Data exploration: discover underlying structure, generate hypotheses, detect anomalies.
Data Compression: for organizing/indexing/storing/broadcasting data.
Applications: useful to any scientific field that collects data!
Relevance: 2340 papers about data clustering indexed in Scopus in 2014!
Data clustering
Data clustering: examples
800,000 scientific papers clustered into 776 topics based on how often
the papers were cited together by the authors of other papers
Given a set of N unlabeled examples D = (x
1; x
2; …; x
N) in a d-dimensional feature space, D is partitioned into a number of disjoint subsets Dj 's:
D =
j=1,kDj where i ≠ j D
i D
j=
where the points in each subset are similar to each other according to a given criterion .
Data clustering
A partition is denoted by
p = (
D
1;D
2; .. ; D
k)and the problem of data clustering is thus formulated as p* = argmin f(p)
p
where f() is formulated according to
.Data clustering
A general optimization (minimization) algorithm for a
classification function J(Y, W) (Y being the dataset and W the ordered set of labels assigned to each sample) can be
described as follows:
Choose an initial classifier W0 repeat (step i)
Change the classification such that J decreases until the classification is the same as the previous one.
If the variables were continuous, a gradient method could be used. However, here, we are dealing with discrete variables.
Data clustering
A reasonable algorithm (based on the simplifying assumption that the optimization problem is separable, i.e., that the
minimum of an n-dimensional function can be found by
minimizing it along each dimension separately) would assign to each sample the label that causes the largest (negative) J.
NB Since the problem is non-separable, there is no guarantee that J decreases as the sum of J’s . It may even increase!
A better but slower solution, which guarantees monotonicity, would be to change, in each step, the label that causes the
greatest negative J.
Data clustering
K-means clustering is developed by choosing the Euclidean distance as the similarity criterion and
J =
S
k=1,NcS
y(i)~wk |y(i) - mk | as the function to be optimized.y(i) is the ith sample and mk is the centroid of the kth cluster.
y
(i)~w
krefers to all samples y
(i)assigned to cluster k
Then M {m1 , m2 , .. , mNc} is the set of reference vectors, each of which represents the ‘prototype’ for a class.J is minimized by choosing mk as the sample mean of the data having label
w
k.
K-means clustering
In practice:
• The algorithm partitions the input space S into a pre- defined number of subspaces, induced by the Euclidian distance.
• Each subspace si of S is defined as:
si ={xjϵS | d(xj,mi) = mint d(xj,mt)}
This induces a so-called
Voronoi tessellation of the input space (example limited to 2D patterns)
Unsupervised learning
K-means clustering
Randomly initialize m {m1, m2, m3, m4, ..., mNc}
repeat
Classify N samples according to nearest mi
Recompute mi (as the mean of the patterns assigned to cluster i)
until there is no change in m