A constrained maximum entropy principle for data reduction via pattern selection

(1)

University of Pisa

Department of Mathematics

Master’s Thesis in Mathematics

A Constrained Maximum Entropy Principle

for Data Reduction via Pattern Selection

Candidate:

Renato Budinich

Advisors:

Prof. Giovanni Punzi

Prof. Maria del Viva

Prof. Vladimir Georgiev

(2)

(3)

Introduction

Perception is a complex emergent phenomenon which, in its various different facets and at various levels, modern neuroscience is striving to explain. While we continue to accumulate biological data on many aspects of the human nervous system, this data alone cannot explain more than the low-level interactions happenning between the many specialized small parts that eventually add up to form cognitive processes. What is direly needed is some form of coherent organization of all the data, or better still a mathematical framework based on simple abstract principles and powerful enough to actually describe, as a consequence of these principles, as much of the observed neurological processes as possible.

In the particular case of the vision system, the last 50 years have brought many new advancements, and today there are many postulations on the various phases intercurring between the moment the photoreceptors on the retina detect a particular pattern of light and the moment we recognize the object that’s in front of us. However we still lack a comprehensive model, detailed enough to make falsifiable predictions regarding how and why information is re-encoded and passed on to the next computational phase.

In this thesis we study a formal mathematical model of a general principle first proposed by D. Benedetti, G. Punzi and M. del Viva in [9], concerning the information reduction that must happen at certain bottlenecks in the nervous complex responsible for vision. The parts of the data that are discarded (or better, those that are preserved) are defined, in our model, as the solution to a combinatorial optimization problem.

(6)

(7)

Chapter 1

The constrained maximum entropy

principle

In [9] the authors propose a new model for pattern selection inspired by the workings of high energy physics (HEP) particle detectors using associative memories (AM) and on the neurological eye-brain system of human vision. Both systems consist of a data acquisition phase (done by particle detectors in HEP and photoreceptor cells in the eye) and a later processing phase; in between a strong reduction of information is known to happen. The authors take as inspiration how this data filtering is done in the electronics implementing AM and postulate that an analogous process happens between the human eye and the visual cortex. Their main original contribution is making an hypotheses on which patterns are selected for this filtering stage in the brain: they claim that a maximum entropy principle is followed, with the important addition of constraints on storage and bandwidth capacities that both the vision and HEP systems present.

In order to precisely formulate this maximum entropy principle, we must first build a formal framework of the type of systems we want to consider. This model is build upon the AM electronics used in HEP particle detectors and can be reasonably assumed to hold, at least at a functional level, for the human vision neurological system. Because of this, before describing the model, in the next section we’ll very briefly summarize how AM electronics work.

A major benefit of this model is that it lets us, for any system that conforms to it, select a small portion of the input data presenting salient features without the need to define, ad hoc for every such system, what a salient feature is; the selection is completely determined by the statistical properties of the input data. For example, we’ll see that many pixel patterns selected by the model are needed for edge detection, which is known to play a relevant role in the reduction of information done by the human vision system. The model is further validated by tests done on human subjects, which show that there’s no statistical difference in the success rate of recognizing images drawn using only the selected patterns (sketches) rather than the original images these sketches are derived from.

(8)

Figure 1.1: An event and the patterns it’s composed of

1.1 Associative memories in HEP

In [4] the authors describe the architecture of an associative memory electronic device that receives data readouts from a particle detector, performs a parallel comparison of the data with a number of hard-wired patterns and finally passes on the results of these comparisons to the successive processing stages. This architecture, which thanks to its parallel nature allows for very fast and concurrent detection of multiple patterns, has since been implemented with success in many HEP experiments.

The detector is assumed to be composed of various parallel layers, each layer having a certain number of bins; particles go across all the layers and hit one bin for each layer. This geometrical model is a simplified abstraction supposed to represent a multitude of real detectors, which have more complex layer configurations. The left part of figure 1.1 shows two particle trajectories crossing four layers, and the bins on the trajectories registering a hit: this is called an event. The track finding problem, which the associative memory is supposed to solve, consists of deducing all the particles trajectories from the combination of bins registering a hit.

Figure 1.2: A particle detector with layers disposed concentrically and an event registered by it An event can be thought of as being composed of multiple patterns, which are trajectories of

(9)

1.1. ASSOCIATIVE MEMORIES IN HEP 9 single particles (right part of figure 1.1). Among all the possible patterns, a subset is chosen and stored in a pattern bank. The role of the associative memory is to read an incoming event and passing on only the patterns that are present in its pattern bank.

An important remark must be made here: the patterns that are currently, in the real world, being used in associative memories, are chosen based on mainly1 _{physical criteria, i.e. patterns}

corresponding to impossible trajectories aren’t stored in the pattern bank, which has limited storage capacity. The constrained maximum entropy model instead chooses a set of patterns determined only by their statistical frequency, calculated offline in a prior data acquisition phase. The claim is that these patterns turn out to be similar to the ones that are already, for physical considerations, deemed interesting.

Figure 1.3: Associative memory architecture

In figure 1.3 the authors detail the architecture of the associative memory: every layer of the detector is connected to its own Data Bus, on which all the addresses of the hits it registered for that layer are sent, in parallel for all the layers. Each row represents a pattern in the pattern bank, coded as a sequence of bin addresses (one for each layer), known as words. Each word is stored in a cell on the corresponding Data Bus, and this cell reads all the bin addresses being transmitted on its Data Bus; if the word coded in the cell matches one of those transmitted on the Data Bus, it activates a flip-flop. If all the flip-flops for a row are active that means that the corresponding pattern was present in the event being analyzed, and thus an identifier for that pattern is sent on the Output Bus, to signal its presence to the successive processing stages.

Thanks to this highly parallel design the associative memory is able to detect patterns in an event very rapidly (fast enough to keep up with the huge speed events are registered on the detector), orders of magnitude faster than it would take to check the event sequentially against the patterns in the pattern bank.

(10)

1.2 The model and the optimization problem

By making an abstraction of the detector-AM configuration described in the previous section, we can think of it as conforming to a simple model: we have a finite number of receptors somehow displaced in space, each receptor being able to register one pattern at a time. We’ll suppose that, other than an identifying index, there’s no difference between these abstract receptors or the patterns that they receive.

The flow diagram we wish to model is represented in figure 1.4: for every time unit, each receptor registers one from a common set of possible patterns and passes it on to a filter, which either discards it or passes it on unchanged. The patterns that pass the filtering stage are then collected into the sketch, which is thus the lossy compressed version of the data incoming on the receptors. Formally, we’ll define:

• X = {1, . . . , x } ⊂ N the set of indexes identifying the receptors

• P = {1, . . . , n} ⊂ N the set of indexes identifying the possible patterns each receptor can register (we suppose this set is equal for all receptors)

• S ⊆ P the set representing the pattern bank or equivalently a single filter: we wish to discard all patterns that are not in this set.

• FS = Sx the filter bank, which is simply x copies of the filter S, one for each i ∈ X . We are

thus making the assumption that the output of every receptor gets filtered in the same way. • The sketch Y, a set of address-pattern couples of the form (i, k) ∈ X × P; if a couple (i, k) is

present in the sketch, this means that pattern k was detected on receptor i and not discarded by the filter.

In order to properly define the sketch, the output of this system, we need to introduce a probability space and random variables representing the data flow.

Let (Ω, Σ, P) be the probability space of experiments’ outcome, then for each i ∈ X we can define the discrete random variables Xi and YiS, which represent the pattern being registered on

receptor i and the output of the associated filter respectively. Formally: ∀i ∈ X Xi: Ω→ P ∀i ∈ X YS i : Ω→ {null} ∪ S (1.2.1) YS i (ω) def = Xi(ω)1{Xi∈S}(ω) + null1{Xi∈S}/ (ω) ,

where null is a special symbol used to signal that the pattern processed by the filter does not belong to S. We can then define the sketch as

Y : Ω → 2X ×S

Y(ω)def= _{{(i, Y}S

i (ω))s.t. Y S

i (ω)6= null} ,

and its size lS _as

lS_{: Ω}

→ N lS(ω)def= |Y(ω)| ,

(11)

1.2. THE MODEL AND THE OPTIMIZATION PROBLEM 11

Figure 1.4: Outline of the model where by | · | we denote the cardinality of a set.

We will suppose that all the Xi are independent and identically distributed (and thus also

the Yi), denoting with X and YS respectively two representatives of these random variables (i.e.

X ∼ Xi and YS ∼ YiS); furthermore, for every k ∈ S, we will define pk def

= P(X = k). While this is certainly a coarse approximation, since in reality receptors that are close together will be strongly correlated, it greatly simplifies successive calculations; with this assumption we will be able to explicitly calculate an exact solution to an otherwise much harder optimization problem.

Now, our main question is, with which criteria do we choose the filter S ⊆ P? Surely we want it to be chosen in some kind of optimal way: since we would like the data coming out of the filter to be most meaningful and descriptive of the data being registered by the sensors, it makes sense to ask for the Shannon entropy of the output of the filter bank to be maximum. The Shannon entropy of a discrete random variable Z : Ω → Z with probability mass function ρ(z) is defined as

H(Z) =−X

z∈Z

ρ(z) log ρ(z) ,

where the logarithm is usually in base 22_{. The random variable is thought of as a source of}

information and H(Z) measures its uncertainty or randomness. It can also be interpreted as the expected value of the random variable − log ρ(Z), i.e.

H(Z) = E[− log ρ(z)] .

This last formula gives a possible intuitive interpretation of entropy: if our source of information outputs a particular element z ∈ Z, we can think that this gives us more information the less "pre-dictable" z is, i.e. the bigger − log ρ(z) is. H(Z) can therefore be seen as the average information contained in a received message from source Z.

(12)

However the real reason why H(Z) is a meaningful and relevant quantity is that important theorems involving it can be proved, above all Shannon’s source coding theorem that, loosely speaking, states that entropy is the lower bound on average word length of any code used to describe Z - it gives a theoretical limit on the achievable lossless compression factor for the information source Z. For a thorough exposure of these concepts see for example [1] or the more modern [3].

Returning to our model, we’ll ask for the filter output to be most meaningful by asking that its entropy is maximum; we thus wish to maximize

HFS def = H((YS i )xi=1) = X y1,...,yx∈{null}∪S P(Y1S = y1, . . . , Y_xS = yx) log 1 P(Y1S = y1, . . . , Y_xS= yx) ,

which, thanks to the random variables being independent and equally distributed, can be written as3 HFS = x X y∈{null}∪S P(Y = y) log 1 P(Y = y) .

However, in order for all this to make sense, we must pose some restrictions on which subsets S ⊆ P we may choose, otherwise, for the following lemma, FP itself would be the filter bank maximizing

entropy (which makes intuitive sense, since choosing P as a filter means we’re not discarding any pattern and thus preserving all information)

Lemma 1.2.2.

∀S ⊂ P HFS ≤ HFP

Proof. From definition (1.2.1), we have

k_{∈ S ⇒ {Y}S _{= k}

} = {X = k} and

{YS _{= null}

} = {X /∈ S} . Thus we can write:

HFS = x X k∈S P(YS = k) log 1 P(YS= k)+ P(Y S _{= null) log} 1 P(YS = null) ! = x X k∈S pklog 1 pk + P(X /∈ S) log 1 P(X /∈ S) ! . Since P(X /∈ S) log 1 P(X /∈ S) = X k∈P\S pklog 1 P j∈P\Spj ≤ X k∈P\S pklog 1 pk ,

3_{since, if}_{X and Y are any two independent random variables, the joint entropy H(X, Y ) is equal to the sum of}

(13)

1.2. THE MODEL AND THE OPTIMIZATION PROBLEM 13 we can conclude HFS ≤ x X k∈P pklog 1 pk = HFP . Surely the most natural constraint is to impose an upper bound on the cardinality of S, since any real world storage device that could be used to store patterns would have limited capacity. We will thus introduce N ∈ N as a model parameter and request

|S| ≤ N . (1.2.3)

We’ll generally be in the situation of having a great number of possibly registered patterns on the receptors and wishing to select only a few of them, thus N n.

Secondly, we must remember that this system we’re modeling is operating data reduction at an early stage, i.e. the selected patterns must be transmitted to successive processing units (for example the visual cortex in the human brain). The channel they’re transmitted on has limited bandwidth, and this somehow must affect the optimal set of patterns we want to store in the pattern bank: if we choose to store the ones with greatest probability pk, for example, they will cost us

a lot in terms of bandwidth, since they will appear very frequently in the sketch which has to be transmitted on the channel. The idea of introducing a bound on the channel capacity to see how it affects pattern selection is considered by the authors of [9] the main theoretical innovation of their model. Formally we’ll introduce a second model parameter W ∈ R and ask that, on average, no more than W patterns can be outputted by our system. In other words, the expected length of the sketch must be smaller than W :

E[lS]≤ W . (1.2.4)

The requirement that this condition has to be met only on average stems from the HEP detector case: there, before transmitting the sketch on the channel, there’s always a FIFO4_{buffering phase}

which allows for the channel capacity to be temporarily exceeded, as long as it later makes up for the accumulated delay.

Thanks to the following lemma we can rewrite this second constraint without reference to lS _or

the sketch: Lemma 1.2.5. E[lS] = x X k∈S pk . (1.2.6)

Proof. If we define the Bernoulli random variable ∀i ∈ X ZS

i : Ω→ {0, 1}

ZiS(ω) =1{YS

i 6=null}(ω) 4_{first in first out}

(14)

we then can write lS_{(ω) =}Px i=1ZiS(ω). Thus E[lS] = x X i=1 E[ZiS] = x X i=1 P(YiS 6= null) = x X i=1 P(Xi∈ S) = x X i=1 X k∈S pk = x X k∈S pk Since the number of receptors x is considered a constant of the model, we can rescale W by a factor 1

x and, using the Lemma, rewrite (1.2.4) as

X

k∈S

pk≤ W . (1.2.7)

We can finally formulate the constrained maximum entropy principle: among the sets S ⊂ P that satisfy (1.2.3) and (1.2.7), we wish to find the ones that maximize the entropy of the filter bank FS. We thus have the optimization problem:

max {HFS where S ⊂ P, |S| ≤ N,

X

k∈S

pk≤ W } . (1.2.8)

Remark 1. The real world bottleneck on channel bandwidth, represented in our model by (1.2.7), is usually very relevant and must be surely taken into account when choosing which patterns to store in the pattern bank. We thus can suppose that this constraint rules out a big number of subsets S ⊂ P, the ones composed of the most frequent patterns. We will therefore suppose that the probability of all patterns not in S far outweighs those in S, i.e.:

P(YS = null) = X

i∈X \S

pi∼= 1 .

This means that we can write H(_FS) = X k∈S pklog 1 pk

+ P(YS_{= null) log} 1

P(YS = null) ∼ =X k∈S pklog 1 pk .

(15)

1.2. THE MODEL AND THE OPTIMIZATION PROBLEM 15 For convenience we’ll define, for any subset of patterns S ⊂ P, its entropy, rate and cardinality:

H(S) =X k∈S pklog 1 pk W (S) =X k∈S pk (1.2.9) κ(S) =_{|S| ;} we’ll also use Htot to refer to the unfiltered total entropy:

Htot= n X i=1 pilog 1 pi . (1.2.10)

Finally, thanks to remark 1, we can consider, in lieu of problem (1.2.8), the optimization problem: max {H(S) where S ⊂ P, κ(S) ≤ N and W (S) ≤ W } . (1.2.11) Since we’re thinking of the pattern set P as an initial segment of N, we can identify every S ⊂ P with a binary string representing the characteristic function 1S:P → {0, 1}. Formally, we’re saying

that the function Φ : P → {0, 1}n _{defined by}

Φ(S) = (1S(i))ni=1

is a bijection; further, if we define the entropy, rate and cardinality of a binary string (xi)n_i=1∈

{0, 1}n _by: H((xi)ni=1) = n X k=1 xkpklog 1 pk , W ((xi)ni=1) = n X k=1 xkpk (1.2.12) κ((xi)ni=1) = n X k=1 xk ,

we’ll have that Φ preserves these quantities, i.e.:

H(Φ(S)) = H(S) W (Φ(S)) = W (S)

κ(Φ(S)) = κ(S) . We can therefore equivalently rewrite (1.2.11) as

max {H((xi)ni=1)st (xi)ni=1∈ {0, 1}n and κ((xi)ni=1)≤ N , W ((xi)ni=1)≤ W } , (1.2.13)

or more explicitly, in what is known as the standard 0-1 integer programming form: max ( n X k=1 xkpklog 1 pk st ∀k x k∈ {0, 1} and n X k=1 xk ≤ N , n X k=1 xkpk ≤ W ) , (1.2.14)

(16)

The two formulations (1.2.11) and (1.2.14) are of course completely equivalent, but allow us to use two different terminologies: from now on, we’ll interchangeably refer to a candidate solution as either a subset S ⊂ P or its correspondent binary string x ∈ {0, 1}n_{, depending on what is more}

convenient.

1.3 A heuristic and initial hypotheses on the exact solution

In [9], the authors propose an heuristic to calculate an approximate solution to this problem: by associating a storage cost 1

N and a bandwidth cost pi

W to each pattern i ∈ P, we can simply define

the maximum between these two as the cost of the pattern. The heuristic then is simply the greedy approach of taking the patterns with best value/cost ratio, i.e. their proposed solution Sheur is

Sheur = {f(pi) > c} = {i ∈ P s.t. f(pi) > c} (1.3.1)

where f(p)def= p log

1 p max {1 N, p W} ,

and where c is determined numerically so as to be the smallest possible without making Sheur unfeasible (in an actual implementation there’s no need to calculate c, since it suffices to calculate the fiand iteratively add the index with greatest value of fiuntil one of the constraints is violated).

0.0 0.2 0.4 0.6 0.8 1.0

p

0.0 0.5 1.0 1.5 2.0 2.5 3.0

f

(p

)

N=20, W=1 N=10, W=1 N=2, W=1 N=10, W=0.4 N=2, W=0.4 N=10, W=0.05

Figure 1.5: Plot of f for various values of N and W . Since f can be written also as

f (p) = ( N p log1 p if p≤ W N W log1_p if p≥W N ,

(17)

1.3. A HEURISTIC AND INITIAL HYPOTHESES ON THE EXACT SOLUTION 17 we can say that if W

N ≤

1

e =_x∈[0,1]max p log 1

p, which in practice is always the case, f is an increasing

function for p ≤ W

N and decreasing for greater values of p. Therefore f will assume its maximum

value in W

N, and Sheur will always be in the proximity of this value; it might actually happen that W

N is not at all near the center of Sheur (see figure 1.6), but we can always say that

j∈Sheursup | W N − j| ∼ |Sheur| . 0 100 200 300 400 500

i

0.0 0.1 0.2 0.3 0.4 0.5 0.6

f

(p

i

)

W = 0.005 W = 0.01 W = 0.025 W = 0.1 N pilog_p1_i 450 460 470 480 490 500 510

i

0.0 0.1 0.2 0.3 0.4 0.5

Figure 1.6: Sheur regions for various values of W , where the probability distribution (pi)ni=1 is the

pixel pattern distribution, and the pattern indexes P are reordered such that (pi)ni=1is an increasing

sequence. The segmented lines are plots of the values W log 1

pi, and thus the plot of f would be the

minimum between these and the solid black line, which is the plot of values Npilog_p1

i.

This proposed heuristic brought us to formulate some initial hypotheses on the optimal solution S to the problem:

1. if apply the appropriate permutation to indexes P so that (pi)ni=1 is an monotone sequence,

then S is an interval, in the sense that

(18)

2. S is unique.

Both these assumptions turned out to be false. In section 3 we’ll discuss some numerical trials, and we’ll see that the exact solution is always quite close to the interval given by Sheur (which in typical conditions commits an error of about 5%), but never an interval itself. In section 2 instead, by studying a possible way to treat problem (1.2.14), we’ll see that the optimal solution corresponds to an optimal path in the decision graph, and equivalent paths give rise to equivalent solutions. This will make it easy to construct pathological probability distributions (pk)n_k=1 that

give rise to non-unique solutions. However it will prove impossible, with this method, to certify whether a real world example of our model has one or multiple solutions; this is due to a form of numerical instability intrinsic in the decision graph algorithm and the inevitable statistical errors in the estimation of the pk.

1.4 Pixel patterns

In order to test if this model can be successfully applied to human vision, the authors in [9] conducted two tests with human subjects, of which we’ll describe one. The outline of the procedure they employed is this:

Figure 1.7: Five images from the database in their full color, digitized and sketched versions 1. Digitalization of images from an image database: each image from a database of

photographs of natural subjects5 _{is projected on a 1-bit space, where each pixel can either}

be black or white. For every image in the database, the luminosity of each pixel is calculated: if its above the average luminosity of the image, the corresponding pixel in the digitalized image is set to 1 (white), otherwise it’s 0 (black). Here luminosity of a pixel is defined as the

(19)

1.4. PIXEL PATTERNS 19 arithmetic mean of the values of its three channels (red,green and blue); see second row of figure 1.7.

2. Sampling of 3 × 3 pixel patterns: patterns are defined as the 29_{possible 3 × 3 pixel black}

and white squares. For each image in the database every square of 3 × 3 contiguous pixels in the image is considered, including overlapping squares, and the frequency pk of every pattern

kin the database is calculated.

3. Selection of patterns using heuristic: Sheuris calculated - see figure 1.8.

4. Creation of sketches: from every digitized image a sketch is created, obtained by starting with a fully white image and considering again all the 3 × 3 squares in it, with overlap: if the corresponding pattern in the fully digitized image belongs to Sheur, then its black pixels, if not already present due to a matched pattern on an overlapping square, are added to the sketch; see last row of figure 1.7.

5. Subject testing: to probe the early stages of human vision, a sketch is presented to a subject on a screen for 20ms, and subsequently, for 700ms, the full digitized version from which the sketch was obtained is shown alongside a distractor, which is simply another image from the database. The subject then has to press one of two computer keys to indicate which image he thinks was the original digitalized image.

In figure 1.9 there is a sample of four types of images that were presented to the four subjects in the experiment and the percentage of correct answers given by each subject for each of the four sets of images; the striped blue bars will be explained below. The color code for the image sets is: Red: 256 grey levels conversion of the original colored photos, used as a control set

Green: sketches for filter Sg= Sheur with W = 0.05 and N = 50

Blue: sketches for filter Sb = Sheurwith W = 0.05 and N = 15

Yellow: sketches for filter Sy obtained by recursively adding the least frequent pattern until

H(Sy) = H(Sb)

The bar graph shows that the reduction of information due to using sketches from Sg and Sb has

negligible effects on the capacity of subjects to recognize images. Syinstead, while having the same

information content as Sb, causes subjects to respond correctly much less frequently. This would

seem to suggest that the filtering of patterns done by the human vision system is quite similar to that done by the heuristic solution to our model.

A natural objection to this conclusion would be that the difference in percentage of correct responses is simply due to the fact that sketches obtained from Sy have very few black pixels, since they are

composed of the least frequent patterns. It’s clear in fact from figure 1.10 that the average of the number of black pixels in sketches obtained from Sy is lower than the one for sketches obtained

from Sb.

To confute this argument, the authors made another test: they artificially assigned weights to the set of sketches obtained from Sb in order for their distribution of number of black pixels to match

the one for the sketches obtained from Sy, and presented the sketches obtained from Sbto subjects

with frequency given by the inverse of these weights, rather than uniform weights as was otherwise done for the other sets of sketches. In other words, they presented subjects with sketches of the Sb

(20)

Figure 1.8: Patterns selected by the authors of [9] using the heuristic solution with N = 50 and W = 0.05

family favoring, in the random selection done to choose the next sketch to present to the subject, sketches with fewer black pixels. The percentage of correct answers to this test is represented with the striped blue bars in figure 1.9, and it shows no difference with the blue bars, i.e. the percentage of correct answers given by subjects when presented with sketches drawn uniformly from the Sb

family.

Furthermore, by organizing the acquired data in appropriate classes, the authors plotted (figure 1.11) the percentage of correct answers as a function of the number of black pixels rather than the individual subjects. If the increased discrimination capacity was really due to the number of black pixels, these would be plots of increasing functions, or at the very least the percentage of correct answers for images obtained from Sy with many black pixels would be greater than that for images

obtained from Sb with few black pixels; this is clearly not the case.

Future tests will include presenting to subjects patterns obtained using an optimal filter solution to problem (1.2.11); as we will see in the second part of this thesis, this solution has a high computation

(21)

1.4. PIXEL PATTERNS 21

Figure 1.9: See text

cost, and this brings the authors to postulate that the limited neurological resources in our brain couldn’t possibly calculate it. On this basis, the heuristic solution might actually select a set of patterns more similar to the ones supposedly selected by the human vision system.

(22)

Figure 1.10: Distribution of the number of black pixels in sketches obtained from Sb and Sy

Figure 1.11: Percentage of correct answers for sketches obtained from Sb (in blue) and Sy (in

(23)

Chapter 2

Solving the optimization problem: an

explicit algorithm

We have seen that the optimization problem we wish to solve can be written in the standard Integer Programming form as max ( _n X i=1 xipilog 1 pi st ∀i x i∈ {0, 1} and n X i=1 xipi≤ W, n X i=1 xi≤ N ) . (BPV) This is a classic combinatorial optimization problem, and there are many software libraries to solve it numerically, such as the PuLP1 _{python module, which is a convenient wrapper for other}

specialized solvers. However we wanted to try and develop an algorithm specific to BPV, in the hopes of gaining some performance and some insight on the problem. Taking inspiration from the most basic way of solving the Knapsack problem, we developed an algorithm which, while correct, performs very poorly and is of little practical use (the PuLP library solves BPV much, much faster). We do believe however that the detailed study of this algorithm and its shortcomings has some theoretical relevance: first of all, thanks to the framework we’ll develop, it will be clear how it’s possible to have multiple optimal solutions, and secondly, the fact that performance is so poor even after some non trivial optimizations suggests us that BPV is not an easy problem, as it often happens with combinatorial optimization problems.

Since the method we developed is closely inspired by the classical dynamic programming approach used for the Knapsack problem, specifically as developed in [7], we will begin this second part of the thesis by reviewing it. Primary purpose of this discussion is to build intuition on the method in this slightly simpler case2_{, since much of this intuitive understanding can be brought on to the}

later sections where we’ll be involved in a more complex framework.

For any W ∈ R, W ≥ 0, the Knapsack problem with integer weights and values can be stated as max ( n X i=1 xivi st ∀i xi∈ {0, 1} and n X i=1 xiwi≤ W with ∀vi, wi∈ N ) , (KP) 1 https://pypi.python.org/pypi/PuLP/1.6.0

2_{simpler to understand, not to solve}

(24)

where vi, wiare respectively the values and weights of some given objects. (KP) models the problem

of trying to stock a container of fixed capacity W with the most valuable combination of objects; each object has a weight and the total weight of the chosen objects can’t exceed W . Observe that we can always suppose 0 6= wi < W and 0 6= vi for every i ∈ {1, . . . , n}, otherwise we’d have

an object that would always either be or not part of any optimal solution, and as such we could eliminate it’s corresponding variable.

It’s apparent that if we think of patterns as objects, each with entropy yield pilog_p1

i and bandwidth

weight pi, (BPV) can be thought of as a Knapsack problem with an additional constraint on the

cardinality of the solution.

There are however two more important differences: firstly, while in the Knapsack problem the weights and values of the objects are chosen independently, the entropy yields and the weights of the patterns are both direct functions of the same probability vector. Secondly, the weights and values in (KP) are natural numbers, while in (BPV) rates and probabilities are real numbers. The algorithm we’ll describe works without modification for both cases, but having real numbers substantially worsens the time complexity, as we’ll see in detail in section 2.6.

The Knapsack problem is NP-hard, which implies that, if P 6= NP, there is no algorithm which is substantially better than just trying all feasible combinations (i.e. those that respect the con-straint) and choosing one that gives maximum value. Here by "substantially better" I mean an algorithm that has time complexity polynomial in the size of the input data (whereas the naive approach of trying all possible combinations is clearly exponential in n).

Whether (BPV) is or not NP-hard is, to our knowledge, unproven: one would need to study the literature and see if the proofs of (KP) NP-hardness can be adapted to (BPV). The only thing that would seem to set our problem apart from the D = 2 case of the D-dimensional Knapsack problem, which can be stated as

max ( n X j=1 xjvj s.t. ∀j xj ∈ {0, 1} and      w11 . . . w1n w21 . . . w2n ... ... ... wD1 . . . wDn           x1 x2 ... xn      ≤      W1 W2 ... WD      with ∀i = 1, . . . , D ∀j = 1, . . . , n, wij ∈ N, Wi ∈ R, 0 < wij ≤ Wi ) ,

is the already mentioned fact that in BPV weights and values are not chosen independently. How-ever, since we weren’t able to effectively leverage the dependency between entropy yields and band-width costs at any stage of our study3_{, we suspect that BPV is also NP-hard.}

In the next sections we will see how our problem can be reformulated (in two different ways) on the decision graph: each subset S of {1, . . . , k} (for some k ≤ n ) is represented by a path in the graph, the last node of this path having two children; adding one child or the other to the path means either taking or not the k + 1-th object, i.e. considering sets S1 = S or S2 = S∪ {k + 1}

respectively. By finding an optimal path on the graph we can find an optimal set that solves our problem.

In section 2.2 we will discuss the decision graph formulation for the Knapsack problem, which is a reinterpretation of the classic dynamic programming approach. This will lead us to state and

(25)

2.1. SOLVING KNAPSACK WITH DYNAMIC PROGRAMMING 25 prove correctness, in section 2.3, of a general algorithm to find an optimal path on the type of graphs we’re interested. We’ll then be able to reformulate BPV as a decision graph problem in two slightly different ways, in sections 2.4 and 2.5 respectively, to solve which we can adapt the general algorithm for optimal paths. Finally we’ll discuss the shortcomings of our approach in section 2.6, and propose a variation of the algorithm that improves on these, albeit giving only an approximation of the optimal solution.

2.1 Solving Knapsack with Dynamic Programming

While, if P 6= NP, there is no polynomial time solution to the Knapsack problem, one can still hope to find an algorithm for solving it which is, if not substantially faster, at least more telling on the nature of the solution than the naive algorithm of trying all possible combinations. Such an algorithm is given by the dynamic programming paradigm, which in its generality "is a method for solving a complex problem by breaking it down into a collection of simpler subproblems"4_.

In practice, for k ≤ n and µ ≤ W , one can define Vk,µ as the value of a subproblem KPk,µ, like

such: Vk,µ def = max ( k X i=1 xivi st ∀i xi ∈ {0, 1} and k X i=1 xiwi= µ ) . (2.1.1) I.e. KPk,µ is the problem of finding, among the set of objects in {1, . . . , k} that have compound

weight µ, the set with the best total value. Let us also define Sk,µas any set of indexes that achieves

the optimal value for KPk,µ - we then have Vk,µ=Pi∈Sk,µvi.

We can find a recursive rule which will let us calculate all the Vk,µ, by reasoning like this: given

an optimal solution Sk,µ for KPk,µ, the k-th object can either be part of this solution or not. If

k /∈ Sk,µ, then Sk,µ ⊆ {1, . . . , k − 1} which implies it is feasible and optimal for KPk−1,µ (since a

better solution for KPk−1,µwould also be feasible for KPk,µ, violating Sk,µ’s optimality). Therefore:

Vk,µ= Vk−1,µ

If instead k ∈ Sk,µ, then solving KPk,µ can be thought of as taking k, setting it apart, and, to find

out which of the previous indexes are in Sk,µ, solve KPk−1,µ−wk. Thus in this case we will have:

Vk,µ= vk+ Vk−1,µ−wk

Since we don’t know a priori if k will belong or not to Sk,µ, we’ll synthesize the above relations by

taking the maximum (if it is at all possible to take the k-th object without violating the constraint): Vk,µ =

  

Vk−1,µ if wk > µ

(i.e. we can’t take k) max {Vk−1,µ, vk+ Vk−1,µ−wk} otherwise

(2.1.2) The computation now amounts to filling in a n × W table, since what we are interested in are the greatest among Vn,µ, for µ ∈ {1, . . . , W }. We need to initialize the first column and first row

like this:

Vk,0= 0 ∀k ∈ {0, . . . , n}

V0,µ=−∞ ∀µ ∈ {1, . . . , W }

(2.1.3)

(26)

Not only can we now iteratively compute the optimal value to (KP), but, if we keep track of which choice is done in the max in (2.1.2) (in each step of the iterative process), we’ll also know all sets of indexes that give us this optimal value. In fact, for any k, µ, if Vk−1,µwins in the maximum,

that means that the k-th object was not taken, if instead vk+ Vk−1,µ is bigger, the k-th object is

indeed part of the solution. If either choice is equivalent, that means that there will be at least two set of indexes that yield us the value Vk,µ (one with and one without k), and these cases are

precisely what gives rise to the non-uniqueness of the solution.

Being that for every cell we simply evaluate (2.1.2), the time complexity is given by the dimension of the table, and is thus O(nW ).

To better explain how this can be implemented and gain some intuition on the method, we will detail a simple example.

Example 1. Suppose we have 4 objects with the weights and values in table 2.1, and suppose the capacity is W = 5. To calculate all the Vk,µ we can think of filling in a table such that the cell

object 1 2 3 4 weights wi 2 1 3 2

values vi 3 1 4 3

Table 2.1: weights and values of the objects

(k, µ)holds the value Vk,µ. The first row and the first column of the table are initialized according

to (2.1.3) and (2.1.2) is applied iteratively from left to right and from top to bottom (see figure 2.1).

Figure 2.1: The fully computed table of the Vk,µ values. Every cell (k, µ) has an incoming arrow

indicating the choice done in (2.1.2) for the computation of Vk,µ; cells with two incoming arrows

are those for which either choice in the maximum is equivalent, and they are colored in red. Let’s see for example how cell (2, 3) is calculated. The weight of object 2 is 1 which is not greater than µ = 3, so in (2.1.2) we have to evaluate the maximum. We have to compare the cell right above with the one on the same row but farther to the left by the object’s weight (in this case 1) -to the latter though we have -to add the object’s value, in this case 1. Here the left cell wins, thus

(27)

2.1. SOLVING KNAPSACK WITH DYNAMIC PROGRAMMING 27 we write it’s value plus the value of the object in cell (2, 3). The algorithm would then move on to calculate cell (2, 4).

In this case the solution is not unique: object 1 and 4 are interchangeable, therefore if there is a solution containing only one of them, there will be at least another equivalent solution. However, we can explicitly find all set of objects that achieve the optimal value 7: S1 ={1, 3}, S2={3, 4}

and S3={1, 2, 4} (incidentally here they all have maximum weight, but it’s not true in general that

different solutions that achieve the optimal value will have the same weight). To do this we first have to find the highest valued cell (this will always be in the last row)5_{, and from there, having}

reversed all the arrows, we simply "walk back" until we reach the first column. For every path find like this we have a solution, composed of the first indexes of the cells with an outgoing oblique arrow. In other words we’re adding to the solution those k for which the second argument won the maximum in (2.1.2) - see figure 2.2.

Every red cell (those for which the maximum in (2.1.2) was achieved by both the values being compared) now has (if at all) two arrows coming out of it, and thus we can conclude that, if the highest valued cell is red, there are 2 + ξ equivalent solutions, where ξ is the number of other red cells encountered when going backwards from the highest valued cell.

Figure 2.2: Solutions S1={1, 3}, S2={3, 4} and S3={1, 2, 4} as paths

Remark 2. The table method we used in example 1 works also if the values vk are real numbers,

but the weights wk need to be natural numbers - in fact we must be able to add weights to a column

index and obtain another column index. In order to use real numbers for weights, we could for example multiply all weights by 10M_{, where M is the number of significant figures used to represent}

them, and change the capacity from W to W 10M_{; with this trick the weights become integers but}

5_{Though it won’t always be in the last column, since, remembering definition (2.1.1), the highest value of the}

(28)

the solutions don’t change, so the approach of example 1 can be used. The problem is that the table now has W 10M _{columns. In the next section we’ll reformulate the problem as a maximum value}

path on a graph, and this will work without modification if any of vi, wi and W are real numbers.

Remark 3. From example 1 it’s apparent that all cells with −∞ don’t bring anywhere, in the sense that they will never win a comparison against a finite value. Thus in an actual implementation only cell (0, 0) will be calculated from the first row.

Remark 4. The heuristic approximation Sheur used for (1.2.11) translates here to ordering objects

by decreasing ratio vi

wi and adding them to the candidate solution in this order, until the weight

constraint is violated. In example 1, this would give us the set of objects {1, 4}, which has value 6 and is thus not optimal.

Remark 5. There are other ways to define subproblems that, when solved, give the solution to the original problem. For instance, instead of using (2.1.1), one could define Wk,ϕas the minimum

weight of a solution consisting of objects in {1, . . . , k} and having value ϕ, i.e.: Wk,ϕ def = min ( k X i=1 yiwi st ∀i yi∈ {0, 1} and k X i=1 yivi = ϕ ) . (2.1.4) Reasoning on whether the k-th object is or not in the solution, exactly like for (2.1.2), we obtain a similar recursion rule:

Wk,ϕ=

  

Wk−1,ϕ if vk> ϕ

(i.e. we can’t take k) min {Wk−1,ϕ, wk+ Wk−1,ϕ−vk} otherwise

(2.1.5) which can be used to compile a table, exactly as in example 1. The optimal value of (KP) is then

max {ϕ | Wn,ϕ≤ W } ,

and can thus be retrieved among the last row of the table.

Remark 2 would translate here in the necessity for the values, not the weights, to be natural numbers. Again one could multiply the values by 10M _{for M big enough to make them all integer,}

at the cost of the number of columns now being V 10M _{instead of V (where V} def₌ Pn

i=1vi).

2.2 KP’s decision graph algorithm

In this section we want to introduce informally the concept of the decision graph and the proposed algorithm to solve an optimal value path problem; in the next section we’ll formalize these concepts and prove the correctness of the algorithm.

From example 1 one can already guess how the problem can actually be formulated as a max-imum value path graph problem. Instead of a table we draw a plane with k on one axis and µ on the other, and a node at point (k, µ) for every subproblem KPk,µ. Each node a = (k, µ) is given

a label αa, at all times equal to the best known value of a path leading to it: initially it is not

defined (or initialized to −∞) and equal to Vk,µ at the end of the algorithm. If the node is not

(29)

2.2. KP’S DECISION GRAPH ALGORITHM 29 and (k + 1, µ + wk+1) respectively. The first edge is horizontal and corresponds to not taking the

k + 1-th object - it has label 0 and we’ll call these type 0 edges; the second one is not horizontal6

and corresponds to taking the k + 1-th object - it has label vk+1and we’ll call these type 1 edges.

If there is an edge going from node a to node b, we’ll call it’s label βa,b; it can either be 0 if the

edge is of type 0 or vk+1 if it is of type 1; see figure 2.3

We won’t take type 1 edges that would reach a node with second coordinate greater than W , since we don’t need to solve subproblems KPk,µ with µ > W ; this is a tree pruning technique which

allows us to explore a smaller portion of the decision graph.

Remark 6. The edge labels are constant throughout the algorithm, while the node labels get updated as the graph is explored - as said, they’ll finally be equal to Vk,µ.

This is called the decision graph because a path from the root node to any node a represents a set of objects, precisely all objects k for which the k − 1 → k edge in the path was of type 1 (the correspondence "path ↔ set of objects" is the same as that in figure 2.2). At every node (k, µ) we can decide whether to take the outgoing type 1 edge or the type 0, i.e. we decide whether to take the k + 1-th object or not; the edge label represents how much the value of the corresponding solution increases by taking that edge. The label αa of a node a is the maximum value of all the

paths from (0, 0) to a, where the value of a path is the sum of the labels of the edges that compose it. Thus, since only type 1 edges have positive label, αa is the sum of values of the objects in the

set of objects corresponding to a.

The idea is then to dynamically construct the decision graph by starting from root node (0, 0) and taking all possible paths, computing node coordinates and their labels on the fly, in order to find the node with the highest label and thus the set of objects that compose an optimal solution for (KP). We will do this using a slight adaption of the classic shortest path tree algorithm for directed acyclic graphs (DAG).

First of all note that the graph is indeed acyclic, since all edges are from a node with k as first coordinate to one with k + 1. Like any DAG it thus can be topologically sorted: one can assign a number δk,µ∈ N to any node (k, µ) in such a way that, if an edge from (k1, µ1)to (k2, µ2)exists,

then δk1,µ1 < δk2,µ2 must hold. To establish such an order, it suffices to choose an ordering of nodes

that gives precedence to the first coordinate, i.e. for every k1< k2, δk1,µ1 must be strictly smaller

than δk2,µ2, for any choice of µ1 and µ2. Visiting the nodes in such an order guarantees us that

we’ll always have examined all the incoming edges to a node before visiting the node itself. The procedure of the algorithm is as follows: starting from the root node (0, 0), we iterate on the nodes following the topological sorting. For every node τ we consider its two sons, and for each of them we consider the node label of τ plus the edge label connecting them to τ: this number represents the value of the path from the root node to the son that is composed of the optimal path from the root node to τ plus the last edge, connecting τ and the son. If its greater than the current node label of the son (which is initialized to −∞ but could have been already updated with this same procedure, since it could be son to more than one node), then the son’s node label is updated, and an array predecessor is updated to reflect that the best known predecessor to the son is τ. More formally, for every node τ = (k, µ) and for each i = 1, 2, σi being its two sons (k + 1, µ) and

(k + 1, µ + wk+1), this piece of code is executed:

if ατ+ βτ,σi > ασi then

ασi= ατ+ βτ,σi

(30)

pred[σi] = τ

end if

The condition in the first line is known as Bellman condition, and it corresponds to evaluating the max in the second line of equation (2.1.2). If the two quantities being compared are actually equal, this means that τ will give to its son a value equal to the one he already had - in other words there are equivalent valued paths leading from the root node to the son.

Every time a node label ασ is updated its value is compared to a global variable best_value,

initialized to −∞; if ασis the greater between the two then best_value is set to ασand best_node,

another global variable initialized to None, is set to σ. In this way at the end of the algorithm, recursively using the predecessor array and starting from best_node, we can rebuild the highest valued path in the graph.

Figure 2.3: Fully computed decision graph of example 1. The small black numbers on the graph are the node and edge labels (only labels of type 1 edges are shown since type 0 edges have always label 0). The red nodes are those for which both incoming edges give the same value. Node (4, 4) has two edges incoming, but the type 0 edge would give it a smaller label than the type 1, so it’s discarded. The colored arrows represent the solutions, just like in figure 2.2.

Remark 7. In an actual implementation one could avoid visiting some nodes; for example if we’re visiting node (k, µ), we will consider the edge going to (k, µ + wk+1) only if µ + wk+1 ≤ W . In

section 2.6 we’ll examine in detail some of these pruning strategies for (1.2.11).

Remark 8. Since the number of operations is constant for every iteration, time complexity is then simply given by the number of nodes. For each k there are at most W nodes with first coordinate

(31)

2.3. CORRECTNESS OF DECISION GRAPH ALGORITHM 31 k, thus time complexity is, just like for the table approach, O(nW ). This is a pseudo-polynomial time complexity, which means that while it’s polynomial in the numerical value W , it’s exponential in the number of bits needed to represent W in memory: if we call θ = log2W the approximate

length of W ’s binary representation, time complexity can be written as O(n2θ_{). This observation}

doesn’t apply to n because it’s not an input of the algorithm: the inputs are W and the arrays (vi)ni=1 and (wi)ni=1, whose size in memory doesn’t change based on how we represent n.

Remark 9. The construction of the decision graph (either using Vk,µor Wk,v) is exactly the same

if any of vi, wi or W are real numbers instead of integers.

Remark 10. One could have alternatively stated (2.1.1) as Vk,µ def = max ( k X i=1 xivi st ∀i xi ∈ {0, 1} and k X i=1 xiwi≤ µ ) , (2.2.1) i.e. with a ≤ instead of = for the constraint (the first row in the table should then be initialized to 0 instead of −∞). While this could seem more intuitive, the equivalent graph problem would not be as clean as the decision graph, since there would be W − w1 root nodes from which to start

the graph visit 7 _{- easily circumvented difficulty, but it can be avoided altogether (and make for a}

cleaner graph) by using the definition of the subproblems with the = sign in the constraints. Remark 11. If we wish to find an optimal solution to KP using the subproblems of remark 5, we can still construct a decision graph and find an optimal solution by exploring it. There will be a node (k, v) for every subproblem, the labels on the nodes will now be Wk,v instead of Vk,µ,

and the labels on the edges will be the weights wk instead of the values vk. We’ll still visit the

graph by keeping two lists, one for nodes with first coordinate k and one for their sons, and for every visited node check the Bellman condition, update the predecessor array and the best_node variable. The number of operations per iteration is again constant, so, given that for fixed k there at most V nodes (k, v), where V = Pn

i=1vi, time complexity is O(nV ). The same observation of

remark 8 holds here: this time complexity is pseudo-polynomial.

2.3 Correctness of decision graph algorithm

The decision graph algorithm is a special case of an optimal path algorithm for DAGs. In this section we want to formally state this more general problem, illustrate an algorithm and prove its correctness. As a consequence we’ll thus have that the decision graph algorithm, both as detailed in the previous section for the Knapsack problem and in the next one for (BPV), is indeed correct and gives us an optimal solution. Please note the notations used in this section have no relation to the other sections.

Let G = (N, E) be a graph, N = {1, . . . , n} its set of nodes and E ⊂ N × N its set of edges. For every node k ∈ N and for every edge e ∈ E, we have a node and an edge label, which are simply numbers αk, βe∈ R; the edge labels are considered immutable and an input of our problem,

while the node labels are an auxiliary quantity we introduce and which we’ll reassign multiple times

7_{To see this, think of the second row of the table method: if the first row is set to −∞, only one element in the}

second row will be non-zero, namely1, w1, while if the first row is entirely0, all cells 1, µ will be equal to v1 for

(32)

during the execution of the algorithm. We suppose the graph is directed and acyclic (DAG), which means we can define a topological sorting

δ : N_{→ N}

such that

(a, b)_{∈ E ⇒ δ(a) < δ(b) .}

In order to simplify notations, we’ll suppose to have reindexed the nodes according to the order provided by δ; this way the previous property can be restated as

(a, b)_{∈ E ⇒ a < b .}

We’ll call node 1 the root node, and we’ll suppose that all nodes are reachable starting from the root node. A path is a subset S ⊂ N

S =_{s1, . . . , sk} such that ∀j < k (sj, sj+1)∈ E

i.e. it’s a succession of adjacent edges. We’ll be interested in paths that start at the root node, and we’ll denote the set of these paths with S1; S1,k is the set of paths starting in the root node and

terminating in k ∈ N. We’ll call the value of the path S the quantity V (S) =

k−1

X

j=1

β(sj,sj+1), where S = {s1, . . . , sk}.

For every node k we’ll define its forward star as γ(k) = E∩ [

j∈N

(k, j) ,

i.e. γ(k) is the set of outgoing edges from k. The fact that the graph is a topologically sorted DAG implies

∀k ∈ N γ(k) ⊂ {k + 1, . . . , n} ,

and the condition that any node is reachable from the root node may then be written as

∀k ∈ N ∃h < k s.t. k ∈ γ(h) . (2.3.1) The problem we want to solve is finding the path starting at the root node with the greatest value; formally:

S∈S1

(33)

2.3. CORRECTNESS OF DECISION GRAPH ALGORITHM 33 Algorithm 1 Optimal path algorithm for DAGs

1: ∀k ∈ N αk=−∞ 2: α1= 0 3: predecessor[1] = 1 4: best_node= None 5: best_value= −∞ 6: fork = 1, . . . , n do 7: forj _{∈ γ(k) do} 8: if αk+ βk,j> αj then 9: αj= αk+ βk,j 10: predecessor[j] = k 11: if αj> best_value then 12: best_node= j 13: best_value = αj 14: end if 15: end if 16: end for 17: end for

Theorem 2.3.3. At the end of algorithm’s 1 execution, the path π(best_node) is a solution to problem (2.3.2), where ∀k ∈ N π(k) = ∪j∈Nπj(k) and π1(k) = k πj+1(k) = predecessor[πj(k)]

Proof. Consider, for δ = 1, . . . , n the following proposition:

P(δ) : at the end of iteration δ, for every k ≤ n ∧ δ + 1, π(k)is solution to

S∈S1,kmax V (S) and its value is αk.

It’s clear that if P(n) is true then we’re finished: best_node will be the node with the best label (since every time we update a label we check to see if its better than the previous best one) and thus searching for the maximum V (S) among S in S1,best_nodeor S1is equivalent.

We’ll prove P(δ) by induction on δ = 1, . . . , n. For the base case P(1), consider that during the first iteration all nodes in γ(1) are visited, and among them must be node 2 (remember we’re supposing all nodes are reachable from the root, and node 2 wouldn’t be reachable otherwise because of the DAG property). When node 2 is visited, α2 and predecessor[2] are updated to

0 + β1,2 = β1,2 and 1 respectively. Thus π(2) = {1, 2} is the only element of S1,2 and has value

V (π(2)) = β1,2 = α2, while π(1) = {1} with value 0 = α1 is the only element of S1,1, so P(1) is

true.

Supposing P(δ) to be true, we now want to prove P(δ + 1). If k ≤ δ + 1 then, by inductive hypothesis, π(k) is solution to

S∈S1,kmax V (S) and αk is its value. Then let k = δ + 2; the idea is that

(34)

at the end of iteration δ + 1 = k − 1 the node giving the best possible value to k will be stored in predecessor[k]. Formally we’re saying that, because of 2.3.1,

{h ∈ {1, . . . , δ + 1} s.t. k ∈ γ(h)} 6= ∅ holds, and thus we can choose among the best such indexes:

¯ h _∈

h∈{1,...,δ+1}, k∈γ(h)argmax

αh+ βh,k . (2.3.4)

Because of line 9 and 10 respectively we’ll then have αk= α¯h+ β¯h,k

and

¯

h = predecessor[k] . By definition of π we’ll have

π(k) = k_{∪ π(¯h)}

=_{{k, ¯h, predecessor[¯h], predecessor[predecessor[¯h]], . . . , 1}} and, by definition of V and inductive hypotheses,

V (π(k)) = V (π(¯h)) + β¯_h,k

= α¯h+ β¯h,k

= αk .

We finally need to prove that π(k) is a solution to_S∈S1,kmax V (S); suppose ρ = {ρ1, . . . , ρl} ∈ S1,k to be

a better path to k, i.e. V (ρ) > V (π(k)). If we define ρ0 ₌

{ρ1, . . . , ρl−1}, by inductive hypotheses we have V (ρ) = V (ρ0) + βρl−1,k = αρl−1+ βρl−1,k > αk = α¯_h+ β¯_h,k

which is absurd for the definition of ¯h, since ρl−1∈ {1, . . . , δ + 1} and k = ρl∈ γ(ρl−1).

Remark 12. There is a non-unique solution to (2.3.2) when there are two different paths S1, S2

∈ S1 such that V (S1) = V (S2) =

S∈S1

max V (S). This can happen if, during the graph visit done by algorithm 1, one of these two situations happens:

1. there is a choice that must be done in (2.3.4), or equivalently the two sides of the inequality at line 8, known as the Bellman condition, are actually equal. This happens when there are two paths with the same value ending in the same node.

2. the two sides of the inequality at line 11 are actually equal. This happens when there are two paths with the same value, which is the best value known up to that point.

(35)

2.4. BPV’S DECISION GRAPH ALGORITHM:_{H-FORMULATION} 35 If in those two lines we substitutde the strict inequality > with a ≥, the algorithm is still correct - it would simply, in case of multiple solutions, output the last optimal path found instead of the first.

In the actual python implementation of algorithm 2, which is an adaption of algorithm 1 to solve (1.2.11) and which will be discussed in detail in the following section, predecessor is implemented not as an array, but as a dictionary, and best_node is a list of values. The keys of the predecessor dictionary are nodes, and the values are lists of nodes that can equivalently be taken as predecessor to them; the items of the best_node list are all the nodes achieving, through the optimal path lead-ing to them from the root node, the value best_value. At the end of the algorithm, if best_node has more than one item or if it has one item but predecessor[best_node] is a list with more than 1 element, the algorithm affirms there are multiple optimal solutions.

Remark 13. If we wish to solve

S∈S1

min V (S) (2.3.5)

instead, we can do so with a very simple adaption of algorithm 1: it suffices to initialize best_value and the node labels to ∞ instead of −∞ and reverse the inequalities on lines 8 and 11 (maintaining them strict).

Remark 14. Algorithm 1 examines each edge exactly once, and for each edge it does a constant number of operations, thus time complexity is O(|E|). In the Knapsack and in (1.2.11)’s decision graph, every node has at most two outgoing edges, thus time complexity is linear in the number of nodes, i.e., with the notations of this section, it’s O(|N|).

2.4 BPV’s decision graph algorithm:

_{H-formulation}

We can finally extend the ideas from the previous sections to (1.2.11): just as for the Knapsack problem, we can define subproblems of the main problem, and among the solutions of these auxiliary problems will be a solution to the main problem. The optimal values of the subproblems satisfy a recursion rule which we can use to iteratively calculate the solution to one problem by using the previously calculated ones for other problems. All this will have a clear and intuitive interpretation once we reformulated the problem as an optimal path problem on the decision graph.

Drawing a parallel to (2.1.1), we wish to consider the problem of finding the optimal entropy set of pattern indexes, restricting ourselves to solutions contained in {1, . . . , k}, with cardinality ν and compound rate µ, i.e.:

∀k ∈ {1, . . . , n} , ∀ν ∈ N, ν ≤ N , ∀µ ∈ R, 0 ≤ µ ≤ W , Hk,µ,ν=max ( k X i=1 xipilog 1 pi st ∀i x i∈ {0, 1} and k X i=1 xipi= µ, k X i=1 xi= ν ) . (2.4.1) With an abuse of language, we’ll use Hk,µ,ν to refer both to this subproblem and to its optimal

value - it should be clear from context which meaning we’re using. When we want to refer to the whole set of these subproblems for varying values of k, µ and ν, or simply to this approach at solving the problem, we’ll talk about the H-formulation.

(36)

Remark 15. Since Pk_i=1xipi can be equal to at most 2k different values in [0, W ] (less if there

are different sets of patterns with the same rate), the majority of µ ∈ [0, W ] (in fact all but a set of Lebesgue measure 0) make the subproblem Hk,µ,ν, for any k and ν, have an empty feasible set,

i.e. there are no binary strings satisfying its constraints.

Remark 16. It’s easy to see how knowing the solution to all the auxiliary problems (2.4.1) gives us the solution to (1.2.11): the optimal value of the main problem is

µ≤W,ν≤Nmax Hn,µ,ν

and, if ¯µ, ¯ν realize this maximum, then the binary string ¯x ∈ {0, 1}n_{that is a solution to subproblem}

Hn,¯µ,¯ν is also a solution to (1.2.11).

We can reason, just like for the Knapsack Kk,µ subproblems, on whether the k-th object will be

part or not of the solution to Hk,µ,ν, and obtain the following recursion rule:

Hk,µ,ν =

  

Hk−1,µ,ν if pk> µ

(i.e. we can’t take k) max {Hk−1,µ,ν, pklogp1k +Hk−1,µ−pk,ν−1} otherwise.

(2.4.2) Using this relationship between the values Hk,µ,ν, we can define the decision graph problem,

which we will show is equivalent to (1.2.11). This graph problem conforms to the optimal path problem described in section 2.3, and can thus be solved using algorithm 1. We’ll give an iterative definition of the nodes and edges that constitute the graph, which will let us visit the graph starting from the root node (0, 0, 0) without prior knowledge of the set of nodes and edges; if we call MH

the set of nodes and E ⊂ MH

× MH _{the set of edges, we’ll have:}

               (0, 0, 0) _{∈ M}H (k, µ, ν) _{∈ M}H ⇒            σ0(k, µ, ν)∈ MH and ((k, µ, ν), σ0(k, µ, ν))∈ E if k + 1 ≤ n σ1(k, µ, ν)∈ MH and ((k, µ, ν), σ1(k, µ, ν))∈ E if k + 1 ≤ n, µ + pk+1≤ W and ν + 1 ≤ N , (2.4.3)

where σ0 and σ1 denote the two sons a node can have:

σ0, σ1: MH→ MH

σ0(k, µ, ν) = (k + 1, µ, ν)

σ1(k, µ, ν) = (k + 1, µ + pk+1, ν + 1) .

An explicit but less useful definition of MH _{and E would be:}

MH=_{{(k, µ, ν) s.t. H}k,µ,ν has a non-empty feasible set}

E =_{(m1, m2)s.t. m1, m2∈ MH and m2= σ0(m1)∨ m2= σ1(m1)} .

For convenience we’ll call the two type of edges type 0 and type 1 edges, formally defining an application τ assigning to every edge its type:

τ : E_{→ {0, 1}} τ (m1, m2) =

0 if m2= σ0(m1)

1 if m2= σ1(m1) .

(37)

2.4. BPV’S DECISION GRAPH ALGORITHM:_{H-FORMULATION} 37 • for every (k, µ, ν) ∈ MH_{, the node label α}

k,µ,ν ∈ R is at all times (during the graph visit)

equal to the best known entropy among the feasible solutions in Hk,µ,ν ,

• for every type 0 edge (m1, σ0(m1)the edge label will be βm1,σ0(m1)= 0,

• for every type 1 edge (m1, σ1(m1)), if m1 = (k, µ, ν), the edge label will be βm1,σ1(m1) =

pk+1logpk+11 .

Remark 17. Just as for the Knapsack, this graph is a DAG, though now it’s three dimensional and not planar, since the first coordinate can increase independently of the other two. To have a topological sorting it suffices to order nodes by their first coordinate, and it doesn’t matter how we order nodes that have the same first coordinate: in fact nodes with the same coordinate have no edges between them.

The optimal path problem we want to solve on this graph is

S∈Sr

max H(S) , (2.4.4)

where Sr is the set of paths in the graph starting from the root node r = (0, 0, 0), and, if S =

{s0, . . . , sj}, j ≤ n is such a path, then

H(S) = j−1 X k=0 τ (sk, sk+1)pk+1log 1 pk+1 .

For an intuitive understanding of the decision graph, we must understand how every path in it corresponds to a choice of pattern indexes. If we think of starting a graph visit at the root node (0, 0, 0), at each step taking one of the two outgoing edges means either taking a pattern index or not: if we are in node (k, µ, ν), choosing the type 1 edge means adding pattern k +1 to our solution, and thus gaining its contribution pk+1log_p1

k+1 to the solution entropy, which is exactly the label of

the edge we’ve chosen. Choosing the type 0 edge instead means we ignore pattern k + 1, and that is why the label of type 0 edges is 0.

In order to prove equivalence of (2.4.4) with (1.2.11), we need to formalize this correspondence between paths in the graph and binary strings (xi)ni=1. For every path S ∈ Sr, S ={s0, . . . , sn}8,

we’ll define the values

W (S) = n−1 X k=0 τ (sk, sk+1)pk+1 , κ(S) = n−1 X k=0 τ (sk, sk+1) .

We can then define the feasible sets of the two optimization problems (2.4.4) and (1.2.11) as, respectively 9_:

F = {S ∈ Srs.t. W (S) ≤ W, κ(S) ≤ N}

G = {(xi)ni=1∈ {0, 1}n s.t. W ((xi)ni=1)≤ W, κ((xi)ni=1)≤ N} .

We then have:

8_{from now on we’ll always suppose a path}_{S to have exactly n nodes; if it has less, we can iteratively append to}

it type 0 sons without affecting the values ofH(S), W (S) and κ(S)

(38)

Theorem 2.4.5. The functionF :_{F → G defined on S = {s}0, . . . , sn} ∈ F as

F (S) = (τ (si−1, si))ni=1 ,

is a bijection and it preserves the entropy, rate and cardinality of solutions, i.e.:

H(F (S)) = H(S)

W (F (S)) = W (S) (2.4.6) κ(F (S)) = κ(S) .

Proof. Relationships (2.4.6) follow immediately from the definitions of H(·), W (·), κ(·) on F and G and the definition of F .

To prove injectiveness, let S1₌

{s1

0, . . . , s1n} and S2 ={s20, . . . , s2n} be two different feasible paths

in F, and let i = j=1,...,nmin {s 1 j 6= s 2 j} . Since s1

0= s20= r, i ≥ 1 must hold and we can say s1i−1= s2i−1. Now, both s1i and s2i must be sons

of s1

i−1, because a path is a succession of nodes each one son of the preceding one; we’ll then have

that τ(s1

i−1, s1i)and τ(s2i−1, s2i)will be different, and these are exactly the i-th component of F (S1)

and F (S2_{), which therefore must be different binary strings.}

To prove surjectiveness, let (xi)ni=1∈ G; we’ll define

   s0 = (0, 0, 0) sk+1 = σ0(sk) if xk+1= 0 σ1(sk) if xk+1= 1 .

This implies that

τ (sk, sk+1) = xk+1 , (2.4.7)

and thus S = {s0, . . . , sn} is indeed an element of F, because

W (S) = n−1 X k=0 τ (sk, sk+1)pk+1 = n−1 X k=0 xk+1pk+1 = n X k=1 xkpk = W ((xi)ni=1) ≤ W ,

and similarly κ(S) = κ((xi)ni=1)) ≤ N. Finally, it follows trivially from (2.4.7) that F (S) =

(39)

2.4. BPV’S DECISION GRAPH ALGORITHM:_{H-FORMULATION} 39 Algorithm 2 BPV decision graph algorithm in the Hk,µ,ν formulation, with pruning. The

add_node function takes care of updating the alpha and predecessor dictionaries, the best_entropy and best_node to keep track of the best known node and finally adding, if not already present, the children to next_visitlist.

1: visitlist = list((−1, 0, 0)) 2: next_visitlist = list() 3: whilevisitlist.notEmpty() do 4: (k, µ, ν) = visitlist.pop() 5: σ0= (k + 1, µ, ν) 6: σ1= (k + 1, µ + pk+1, ν + 1)

7: if k + 1 < n and µ + pk+1≤ W and ν + 1 ≤ N and α[cur] + η[k + 1] ≥ best_entropy then 8: add_node(σ0) 9: add_node(σ1) 10: end if 11: if visitlist.empty() then 12: visitlist = next_visitlist 13: next_visitlist = list() 14: end if 15: end while

To solve (2.4.4) we’ll use algorithm 2, which is essentially algorithm 1 with a few tweaks for this specific case; the structure is slightly different and is quite close to the actual python implementation of the decgraph_solver in the BPV class. A few explanations are in order:

• We’re supposing the patterns to be indexed in such a way that (pk)n−1_k=0 is an increasing

sequence

• Since arrays in python start with index 0, the first pattern in the code will be 0, and thus the root node is (−1, 0, 0)

• In order to apply the main for loop structure of algorithm 1, we would have to first explicitly build a topological sorting to number the nodes accordingly. We would then need to explore all the graph twice: once for finding out the nodes and edges and build the topological sorting and once to actually solve the optimal path problem. However, we can comply to the structure of algorithm 1, and thus be sure of finding an optimal solution thanks to theorem 2.3.3, by visiting the graph by any topological sorting, we don’t care which. Thanks to remark 17, we can do this simply by maintaining a visitlist consisting only of nodes with the same first coordinate, and next_visitlist where we add their sons. When we visited all nodes in visitlist, we copy next_visitlist onto it and initialize next_visitlist to a new list. • Line 7 consists of various checks that allow us to not visit sons to the current node under

certain conditions, which effectively limits the portion of graph we explore; for now we’ll make some brief remarks, and in section 2.6 we’ll analyze these conditions in more detail.

A constrained maximum entropy principle for data reduction via pattern selection

University of Pisa

Department of Mathematics

Master’s Thesis in Mathematics

A Constrained Maximum Entropy Principle

for Data Reduction via Pattern Selection

Candidate:

Renato Budinich

Advisors:

Prof. Giovanni Punzi

Prof. Maria del Viva

Prof. Vladimir Georgiev

Contents

Introduction

Chapter 1

The constrained maximum entropy

principle

1.1

Associative memories in HEP

1.2

The model and the optimization problem

1.3

A heuristic and initial hypotheses on the exact solution

p

f

(p

)

i

f

(p

)

i

1.4

Pixel patterns

Chapter 2

Solving the optimization problem: an

explicit algorithm

2.1

Solving Knapsack with Dynamic Programming

2.2

KP’s decision graph algorithm

2.3

Correctness of decision graph algorithm

2.4

BPV’s decision graph algorithm:

H-formulation

_{H-formulation}