Combining asymptotic results with good finite-horizon performance: an empirical test of a non-parametric kernel-based portfolio strategy

(1)

ECONOMICS AND MANAGEMENT DEPARTMENT

MASTER OF SCIENCE IN ECONOMICS

MASTER DEGREE THESIS

COMBINING ASYMPTOTIC RESULTS WITH GOOD FINITE-

HORIZON PERFORMANCE: AN EMPIRICAL TEST OF A NON-

PARAMETRIC KERNEL-BASED PORTFOLIO STRATEGY

Candidate

Supervisor

Lenzi Pietro

Bottazzi Giulio

(2)

INDEX

SECTION 1 – INTRODUCTION 1

SECTION 2 – THEORETICAL FRAMEWORK 6

2.1 BACKGROUND AND SET UP 6

2.2 STRATEGY DESCRIPTION 11

2.3 LINK WITH MACHINE LEARNING 14

SECTION 3 – IMPLEMENTATION 16

3.1 TECHNICAL ISSUES 16

3.2 DATA DESCRIPTION 19

3.3 METHODOLOGY 21

SECTION 4 – ANALYSIS AND DISCUSSION OF RESULTS 25

SECTION 5 – FURTHER DISCUSSION ON THE METHODOLOGY 34

SECTION 6 – CONCLUSIONS 37

REFERENCES 39

APPENDIX 42

1 DATASET COMPOSITION 2 SCRIPT

(3)

1

SECTION 1 – INTRODUCTION

Portfolio optimization has proven to be a puzzling problem ever since its introduction with the pioneering work by Markovitz (1952). The issue consist of determining a certain way to distribute a given capital among the available assets at the begin of the trading period. However an infinite array of combination can be attempted as well as the goals can vary depending on individual preferences and needs.

The initial allocation by Markovitz was aimed to achieve the optimal trade-off between return and risk , hence choosing the portfolio, among a set of efficient ones, which maximizes the return relative to the risk of the individual, a measure later identified as Sharpe (1966) Ratio. The setting is valid also introducing a utility function (Levy and Markowitz, 1979) which makes it more theoretically grounded. Despite the theoretical soundness the mean-variance model relies on the assumption of known statistical distribution of the returns, that is not the case in the practical implementation with real data. Follows the necessity of estimating parameters. An immediate approach is the “plug-in” estimation consisting of two steps: by MLE on the historical data, the moments (𝜇̂ , 𝛴̂) are estimated and then they are substituted in the solution 𝑤∗₌ 1

𝛾 𝛴

−1_{𝜇 of the} utility maximization as proxies of the true parameters. However this turns out to be subject to severe estimation errors: Jobson and Korkie (1981) showed that even the naïve 1/N portfolio can easily outperform mean variance in practical application. The reason of the poor performance doesn’t lie in the utility function adopted since for similar Arrow-Patt risk aversion different specification of the function result in similar portfolio allocation (Kallberg and Ziemba, 1984). It’s rather the estimation of the moments to generate problems, in particular Stein (1956) proved the sample mean to be an inadmissible estimator since there exist other estimators dominating it for a given risk or loss function. In practice when studying cash equivalent losses due to errors in input estimation, Chopra and Ziemba (1993) showed how errors in mean are 10 times more important than errors in variance which in turn are twice as important than errors in covariance. Indeed this estimation tends to overweight assets with higher estimated returns and low variance which are, intuitively, those carrying higher probability of errors. Michaud (1989) calls it “Error maximization effect” which is further increased by the interaction with constraints often used in the models (Ceria and Stubbs,2006).

To figure out the puzzle a wide array of different techniques have been developed over time. An intuitive starting point is using the minimum variance portfolio which completely avoids the estimation of the sample mean and then this can be combined with short-sale constraints of each assets as in Jagannathan and Ma (2003) or a limit to the total short-sale in the portfolio as in Demiguel et Al (2009a). Another problem is the sensitivity of the final results to input parameters

(4)

2

(Ceria and Stubbs, 2006) which generates uncertainty. But this uncertainty can be incorporated into the optimization process. Defining a specific set and then solving a modified optimization problem such that the allocation found is optimal even for the worst possible value of the expected returns in the set. This means assets with less accurate mean returns estimations will be penalized and hence receive a lower weight in the final allocation. This procedure, called robust optimization (Fabbozzi, 2007), also entails a better use of turnover.

Yet many of these techniques have been claimed to not to be able to consistently outperform the naïve 1/N strategy casting serious doubts on the usefulness of the theory. The extensive work by De Miguel et al (2009b) shows how 1/N is significantly outperformed only for unrealistically large size samples. However it is also worth to notice the dataset used have been criticized to magnify the estimation risk so to make mean variance less effective by Kirby and Ostdiek(2012). More recent solutions include the combination of different portfolios operated for instance by Tu and Zhou (2011) who combine 1/N and four different portfolios from the literature exploiting individuals preferences for mixed solutions and the variance-bias trade off. Jiang, Du and An (2018) instead combine 1/N with the minimum variance portfolio. Both these ideas can be included in the broader field of shrinkage estimators. An example is provided by Han (2018) who generalizes and outperform the model of Tu and Zhou (2011) by shrinking their portfolio toward 1/N for a turnover-averse agent, hence penalizing the deviations from the target portfolio. Finally the application of neural networks and machine learning (for instance in Fu et al, 2018) have enlarged even more the set of methodologies available but this has not prevent the discussion on optimal allocation to continue.

However not all the literature focuses on risk adjusted measures. Many relevant authors over time has proposed to shift the attention to the long run growth of capital, a much more practical goal which is pursued most of the times by fund managers and private investors. Basing on Kelly (1956) criterion, the related log-optimal portfolio has emerged as a the strategy achieving the maximum growth of capital invested over time (Breiman, 1960,1961). This portfolio needs a known probability distribution of the returns which is not the case in reality. The fundamental contribution then was the introduction of the concept of universal portfolios (Cover, 1991) that is strategies able to achieve asymptotically the same rate of growth of the best portfolio in a certain class but without knowing the returns distribution. But the methodology was quite complex, indeed later authors managed propose easier methods (Gyorﬁ and Schafer, 2003) as well as to prove existence of universal portfolios in larger classes (Algoet, 1992). Still the issue is that the results proven by previous literature are asymptotic and this is of little help for successful practical application. Section 2 provides a detailed summary of the contributions over time on log-optimal portfolios.

(5)

3

Focusing on this literature, with a practical-oriented approach, this work will present a strategy by Gyorfi, Lugosi and Udina (2006) which adds to log-optimal portfolios ideas form kernel density estimation and machine learning in order to achieve not only the asymptotic results of universal portfolio but also a good performance in the finite horizon, where previous models have often failed. The idea is to come up with a sequential investment strategy able to estimate the optimal portfolio weights for the available assets at each trading day and then to update them as new data on the returns becomes available. The authors try to achieve a very general result by avoiding to impose many assumptions on the market returns, the only restrictions are stationarity and ergodicity of the process. The strategy is thus non-parametric and data-driven.

Grounded on previous theoretical results this work, after the necessary description of the strategy, will not focus much on theorems but rather on providing an empirical application of the methodology to assess whether the great results claimed by the authors are replicable even with different datasets. Providing a new application of a strategy which has received few attention by later literature is the main novelty of this study. Moreover it will also try to compare with previous results, provide some insights on the performance and point out some possible fields for further research. Specific attention will be devoted to the use of machine learning due to its popularity nowadays.

To achieve these goals the choice of data and methodologies has been particularly relevant. Daily CSRP Data on stocks returns for US equity markets spanning from 1989 to the end of 2018 have been downloaded and used to build two main datasets with the biggest companies for capitalization in the last decade and those with the highest volume, that is with the most liquid stocks. Even if a few companies appear in both, there are enough differences to make the distinction meaningful. Finally a few more results have been obtained by sectoral datasets in order to see whether the algorithm is able to perform better for certain kind of firms. All the datasets have been built form raw data using MATLAB software without being already present in the literature. The study, whose methodology will be fully clarified in section 3, is made more robust by the comparison of the out-of-sample performance with a set of reference portfolios including the naïve equal weight portfolios both buy&hold and constantly rebalanced, minimum variance, mean-variance and the best constantly rebalanced portfolios in hindsight. The performance measures considered, in line with the aim of the strategy, will be the capital achieved at the end of the period and the annualized average yield (AAY) to allow comparisons for different length of the investment horizons. The computational time required to run the algorithm will be taken into account as well.

This work is aimed to contribute to the vast empirical literature on universal portfolios and log-optimal portfolios by providing results from a practical test on the newest data available. The

(6)

4

previous experiments have been focusing on quite old data so this is a way to extend them and check whether similar results can be achieved even with quite different data.

Moreover the use of different combination of the various parameters involved (the number of assets, the length of the estimation and the evaluation windows, among others) in the strategy allows to assess whether there exists direct relations with the results that can be fully exploited to maximize the performance. At the same time the different data used, especially with the presence of sectoral datasets can help to find the most suitable setting for the strategy. These aspects has received less attention in the original paper where there were only two main big dataset from previous literature with fixed number of assets. The comparison is made with different portfolios as well.

Also the literature on the fields related to this strategy will receive a contribution showing how kernel density estimation and machine learning can be combined in portfolio optimization and generate good results even in a very short time horizon when less data are available. Many other fields can be explored using the same methodology, also unrelated to financial economics, some examples will be provided at the end of section 5.

The main finding of the work is the ability of the kernel-based strategy to perform well even in the finite horizon and specifically in the very short period of a few years. The annual performance then improves as the starting date is set earlier in time, that is if one starts to invest earlier with this algorithm not only the final capital is higher in absolute value but also the growth is faster on a yearly base.

By the way it is also worth to report that the results shown are much lower than in previous studies. However this might do to a series of differences which are described in section 4. The main similarity is in the short term results, even a few years only, which was the main goal of carrying out this study. Instead the most important difference arises when the evaluation period becomes very large: while Gyofi et Al (2006) have a sudden boost in performance at around 4000 trading days which then delivers exponential performance, this study displays only a steady growth.

The sectoral datasets don’t provide significant results, that is the performance is good in only one sector but still lower, in general, than in the bigger datasets. This probably suggests to restrict the attention to a certain sector only when expecting it will perform well in the immediate future but with an higher chance to be outperformed even by naïve strategies.

Finally the portfolio is able to easily outperform the other strategies proposed, often doubling the performance. It fails to be the best only in a couple of cases against mean variance when the

(7)

5

estimation window is incredibly long. The strategy, as the estimation window increases and the starting date is set more in the past approaches the performance of the BCRP in hindsight which is a remarkable result. Looking at the trend by setting the starting date even a bit earlier in the past the BCRP would probably be outperformed.

The main conclusion is then to claim the goodness of this methodology in providing a relevant annual capital growth also for the finite horizon and the short term, even if with less emphasis than in previous experiments. In addition the comparison the original paper allows to understand the main weak points of this work as well as the main improvements the methodology can still benefit, as it will be explained in later sections.

Next section works as theoretical framework prior to introduce the algorithm in details including also a specific focus on the link with machine learning. Follows section 3 with the description of the practical application of the strategy, the datasets adopted and the methodology followed. Then the analysis and discussion of results take place in section 4. Section 5 shows some issues on which further research may focus as well as different uses of this methodology and finally section 6 concludes.

(8)

6

SECTION 2 – THEORETICAL FRAMEWORK

2.1 BACKGROUND AND SET UP

The works by popular scholars such as Markowitz (1952), Tobin (1958), Sharpe (1964), and Lintner (1975) made the maximization of return given certain levels of risk (even with expected utility) as the mainstream goal in portfolio allocation. Nevertheless for an investor, either being a professional fund manager or a private portfolio owner, the Sharpe ratio maximization might be less relevant. Instead maximum growth of capital over time should be , even intuitively, an operational criterion for portfolio selection (Latane, 1959). Thus investors should seek for a growth-optimal portfolio, that is the portfolio with highest expected rate of increase in value. Breiman (1960) showed this is achieved by maximizing the expected logarithm of wealth (or, equivalently, maximizing the geometric-mean return). This implies an expected-logarithmic-utility (Hakansson, 1971) which in addition offers some other useful properties in the models such as decreasing marginal utility of wealth, decreasing absolute risk aversion and constant relative risk aversion. Indeed Rubinstein (1977), among others, claimed that logarithmic utility functions should describe preferences for all rational investors. While mean variance may be optimal from an expected utility viewpoint, on the other hand it may lead to ultimate ruin in an infinite sequence of re-investment. The growth-optimal portfolios instead maximize the probability of exceeding a given level of wealth within a fixed time.

Log-optimum portfolio stems from Kelly (1956) criterion, a betting strategy prescribing to invest, when odds are fair, a percentage of the total income equal to the probability of winning in a series of repeated bets in order to maximize the growth of capital over time. Despite log utility of wealth dates back to Bernulli in 1938, it was Breiman (1960,1961) to prove rigorously the equivalence between the Kelly criterion and the maximization of the one period1_{expected logarithm of} wealth2_{. However Kelly criterion has also some drawbacks, in particular one needs to know in} advance the distribution of the outcomes, this exposes to estimation risk. Much effort have been dedicated to the log-optimum portfolio, the main framework is presented below since it is works as a base for the main methodology discussed in this work.

Given d assets available, let x(j) _{denote the return}_{, defined as the factor by which capital invested} in the j-th asset grows during the trading period, Hence it is nonnegative and let b(j) _{be the} proportion of the investor’s capital invested in asset j. Then a market vector is denoted by xi = (x(1)_,...x(d)_{) ∈ 𝑅+}𝑑_{, while a portfolio vector is b}

i = (b(1),...b(d)) which the investor is allowed to choose 1_{Since this utility is myopic}

(9)

7

at the beginning of each trading period. The components of b are assumed to be nonnegative that is b(j) _{≥0 ∀j which means short selling and buying stocks on margin are not permitted. Moreover} the assumption of ∑𝑑𝑗=1𝑏(𝑗)= 1 holds, meaning that the strategy is self-financing at each period without consumption of capital. These two features can be both summed up saying that ∆d is the simplex of all market vectors b ∈ 𝑅+𝑑_.

Denoting the investor’s initial capital with S0 and the inner product with ⟨ , ⟩ at the end of the trading period the investor’s wealth becomes 𝑆1 = 𝑆0⋅ ∑𝑑𝐽=1𝑏(𝑗)𝑥(𝑗) = 𝑆0⟨𝑏, 𝑥⟩ . For any trading period i, a vector xi is generated, so that the component 𝑥_𝑖(𝑗) is the amount obtained after investing a unit capital in the j-th asset on the i-th trading period. For i≤n , (xi,…,xn) is an array of market vectors for each trading period from i to n and it is summarized by the notation 𝒙𝒊𝒏_{. An} investment strategy B is then a sequence of functions

𝑏𝑖̇: (𝑅+𝑑) 𝑖−1

→ 𝛥𝑑 ⅈ = 1,2 …

so that 𝑏𝑖(𝒙_𝟏𝒊−𝟏) denotes the portfolio vector chosen by the investor on the i-th trading period, upon observing the behavior of the market till the previous period i-1 . For ease of notation 𝑏(𝒙𝟏𝒊−𝟏) = 𝑏𝑖(𝒙𝟏𝒊−𝟏) will be used.

Starting with an initial wealth S0, after n trading periods, the sequential investment strategy B achieves the wealth:

𝑆𝑛= 𝑠0∏⟨𝑏(𝒙𝟏𝒊−𝟏), 𝑥𝑖⟩ 𝑛 𝑖=1 = 𝑆0ⅇ ∑ log⟨𝑏(𝒙_𝟏𝒊−𝟏),𝑥𝑖⟩ 𝑛 𝑖=1 = 𝑠₀ⅇ𝑛𝑊𝑛(𝐵) Where 𝑊𝑛(𝐵) = 1 𝑛∑ log⟨𝑏(𝒙𝟏 𝒊−𝟏_{), 𝑥} 𝑖⟩ 𝑛

𝑖=1 highlights the average growth rate of the initial capital S0. Maximizing 𝑆𝑛= 𝑆𝑛(𝐵) is of course the same of maximizing 𝑊𝑛(𝐵).

Then the next step is to make assumptions on the behavior of the market. A first approach is that of Cover (1991) who avoids to impose any kind of model for the returns generating process, so taking arbitrary values for the market vector. The main limit to the log optimum portfolio is the necessity of a known probability distribution of returns to work, which is not the case in practical application. Then Cover introduced the idea of Universal Portfolios, that is strategies which perform almost as well as a class of reference portfolios in hindsight without knowing the distribution. He addresses to the class of constantly rebalanced (CRP) portfolios. In a nutshell, the best constantly rebalance portfolio (BCRP) performance in hindsight is of course not achievable with ex-post CRP strategies, however the log optimal universal portfolio is the best

(10)

8

performing strategy and has the appealing property of asymptotically growing less than the BCRP in hindsight but with a loss which is less than linear.

Cover suggests to invest a share of capital in each asset j equal to the weighted mean of the share invested by all possible constant rebalanced portfolios, where the weights are determined by portfolio past performance. The reason is that under suitable conditions the average of exponentials has the same rate of growth as the maximum hence one achieves almost as much as the wealth achieved by the best BCRP.

Even if theoretically important, this methodology has several practical limitations: it gives an asymptotical result, so it requires many data and it is exponential in the number of assets, that is its computational complexity rises quickly and then it is unsuitable for large portfolio managers. Györfi et Al (2006) also points out the inadequacy when there are dependencies in market vectors of different periods as it seems to happen in the reality.

To overcome some of these issues a second approach has been proposed describing the market according to a statistical model so to take into account the dependencies of market vectors over time. However the approach is still non parametric since no specific distribution is assumed on the market vector (even if realization of a random process) or on the time dependencies. The only assumption is to take x1,x2,... as realizations of the random vectors X1,X2,... drawn from a vector-valued stationary and ergodic process {Xn}−∞∞ _{. Algoet (1992,1994) and Algoet and Cover (1988)} under these conditions show the fundamental limits, often cited in the later literature, to prove that the log-optimum portfolio 𝐵∗_{= {𝑏}∗_{(. )} is the best possible choice. This is due to the fact} under stationarity and ergodicity the asymptotic rate of growth has a well-defined maximum which can be achieved in full knowledge of the distribution of the whole process. On trading period n let 𝑏∗_{(. )}_{be such that:}

𝐸{log⟨𝒃∗(𝑿𝟏𝒏−𝟏), 𝑋𝑛⟩ | 𝑿𝟏𝒏−𝟏} = max

𝑏(.) 𝐸{log⟨𝑏(𝑿𝟏 𝒏−𝟏_{), 𝑋}

𝑛⟩ | 𝑿𝟏𝒏−𝟏}

If 𝑆𝑛∗ = Sn(𝐵∗) denotes the capital achieved by a log-optimum portfolio strategy 𝐵∗, after n trading periods, then for any other investment strategy B with capital Sn = Sn(B) and for any stationary and ergodic process {Xn}−∞∞ _,

lim sup 𝑛→∞ 1 𝑛log 𝑆𝑛 𝑆𝑛∗ ≤ 0 almost surely and lim 𝑛→∞ 1 𝑛log 𝑆𝑛 ∗_{= 𝑊}∗_{almost surely} where

(11)

9 𝑊∗= 𝐸 {max

𝑏(.) 𝐸{log⟨𝑏(𝑋−∞ −1_{), 𝑋}

0⟩ | 𝑋−∞−1}} is the maximal possible growth rate of any investment strategy3_.

Therefore, almost surely no investment strategy ex-post can have a faster rate of growth than a log-optimum portfolio. As said before, to determine a log-optimum portfolio, full knowledge of the distribution of the process is required. Again comes at a help the idea of Universal portfolios: strategies achieving the same rate of growth without knowing the distribution. Sn is universal if:

lim 𝑛→∞

1

𝑛log 𝑆𝑛 = 𝑊

∗_{almost surely}

Even in the class of stationary and ergodic process it has been proven the existence of a universal portfolio thanks to the fundamental contribution by Algoet (1992). His results are theoretically very useful but quite complex. Györfi and Schafer (2003) managed to prove universality of a simpler strategy. Various strategies have been developed afterwards, for instance Histogram-based (Györfi and Schäfer, 2003), using Kernel (Györfi et Al, 2006) or nearest neighborhood estimation (Györfi et Al, 2008).

However there is a main problem left: the works by Cover and Algoet, despite theoretical soundness, achieve important results only asymptotically, requiring extremely high sample size. The strategy by Györfi et Al (2006) analyzed in this work, adding kernel density estimation and machine learning to the empirical log-optimum portfolio models, claims to offer not only the same important asymptotic results but also a very good finite-horizon performance in practice, and this is what will be tested hereafter.

As a final remark, the Kernel density estimation (KDE) is a non-parametric technique to estimate a density when a sample is available but the probability distribution is unknown. Being non parametric, no rigid assumption is necessary, it is completely data-driven. KDE is somehow a generalization of the histogram density estimation offering an outcome which is perhaps even easier to interpret. To get the intuition in the univariate case, after defining a kernel function K(x) which is by definition such that it integrates to one, and a certain bandwidth h, then on an axis a kernel is placed at each data point xi of the sample. The estimated value of the density at any point x on the axis will be determined by the sum of the bumps given by the values of the kernel of any xi which is far from the chosen x less than the bandwidth, that is all the datapoints whose kernel includes x will participate in determining the density on x. The result is then normalized dividing by the value of the bandwidth and the number of datapoints so that the final area

(12)

10

remains equal to 1 being a density. Turns out the choice of the bandwidth is fundamental, but this is outside of the purposes of this work. Mathematically:

𝑓̂(𝑥) = 1 ℎ𝑛⋅ ∑ 𝐾 ( 𝑥 − 𝑥𝑖 ℎ ) 𝑛 𝛬

A comprehensive explanation of this topic can be found in Silverman (1986). Machine learning (ML) instead will be considered separately in section 2.3.

(13)

11

2.2 STRATEGY DESCRIPTION

Previous ideas constitute the background to which Györfi et al. (2006) added KDE and ML to propose the methodology analyzed in this work. Its formal description follows.

The main purpose is to create a sequential investment strategy as general as possible, that is a way to distribute the current capital among the available assets at the beginning of each period of time, without knowing the underlying distribution generating the stock returns. To do so the strategy is allowed to use the only information that doesn’t need estimation: data collected from the past of the market returns. As stated before short-selling as well as leverage are not allowed, hence the weights will be all greater or equal to zero and summing up to 1. The only assumptions made on the market vectors are stationarity and ergodicity. Moreover the market vectors describe a statistical model but this model can take any distribution, it is a completely non-parametric view of the issue. Moreover some practical assumptions are necessary to simplify and make the analysis feasible: the assets are arbitrarily divisible and available in any quantity at the current price at any given trading period, there are no transaction costs and finally the behavior of the market is not aﬀected by the actions of the investor using this strategy.

The strategy can be described as an algorithm divided in two phases. First a set of reference predictors also called experts, is computed, and then their weights are determined to finally allow the combination. The key point of the first phase follows from the goal to find the hidden dependencies by using the kernel density estimation. Define an infinite array of experts indicated by 𝐵(𝑘,𝑙)_{= {𝑏}(𝑘,𝑙)_{( . )} where the indexes k,l are positive integers. For fixed positive integers k,l ,} choose the radius 𝑟𝑘‚𝑙> 0 such that for any fixed k:

lim

𝑙→∞𝑟𝑘‚𝑙 = 0

Then define a kernel function 𝑘𝑘: 𝑅+𝑘𝑑→ 𝑅+ which for simplicity is taken uniform by the authors: 𝑘𝑘(𝑥) = 𝐼‖𝑥‖≤𝑐(𝑥 ∈ 𝑅+𝑘𝑑)

This way all the xi falling in the bandwidth c of the kernel from x have same weight. This translates in the concept of multidimensional radius among norms. For instance Györfi et Al (2006) use r=c/l where c is a constant. Finally each expert is determined as:

𝑏(𝑘,𝑙)(𝒙_𝟏𝒏−𝟏) = arg max

𝑏𝜖𝛥_𝑑 ∑ log⟨𝑏, 𝑥𝑖⟩

{𝑘<𝑖<𝑛:‖𝑥_𝑖−𝑘𝑖−1−𝑥_𝑛−𝑘𝑛−1‖≤𝑟𝑘,𝑙}

Unless any match is found, in that case the portfolio for that day is simply 𝑏 = (1_𝑑⋅ ⋯1 𝑑).

(14)

12

To get the intuition, the idea is to seek for sets of trading days (𝒙𝒊−𝒌𝒊−𝟏) along the series in which the assets returns have been “quite similar” to those of the last k-days prior to n (𝒙𝒏−𝒌𝒏−𝟏). As a similarity measure we use the radius whose size depends on l, if the Euclidian norm of the generic part of the series and the last part is smaller then rkl a matching part in the series have been found. For

each matching part in the series the next day returns are recorded with index i. Then an expert is defined as the log optimal portfolio using only the i-labelled days along the series. Since all the

i are in the past, the returns are available thus the log-portfolio is computable. The result is the

expert for that given k,l pair. Therefore any pair k,l acting on the length of the window and on the width of the radius generates an expert of the set 𝐵(𝑘,𝑙)_{= {𝑏}(𝑘,𝑙)_{( . )} which has k*l experts} inside. It is important that the limit of the radius is 0 as l rises since: if for any k by increasing l the radius increases too, there would be no point in searching for more experts since the later matches will be too many thus less relevant. Hence for larger l the radius must be smaller and smaller so to spot only the really relevant results.

It would be also possible to generalize using a broader class of Kernel investment strategies still based on a sequence of kernel functions 𝑘𝑘: 𝑅+𝑘𝑑 → 𝑅+. The resulting portfolio vector would be:

𝑏(𝑘,𝑙)(𝒙𝟏𝒏−𝟏) = argmax 𝑏𝜖𝛥𝑑 ∏ ⟨𝑏, 𝒙𝒊⟩ 𝑤_𝑖(𝑘,𝑙) 𝛴_{𝑘<𝑗<𝑛}𝑤_𝑗(𝑘,𝑙) 𝑘<𝑖<𝑛

where the kernel is not uniform anymore, so instead of being an indicator function it assigns weights 𝑤_𝑖(𝑘,𝑙)whose magnitude depends on the distance of 𝒙𝒊−𝒌𝒊−𝟏 from 𝒙𝒏−𝒌𝒏−𝟏, that is: 𝑤𝑖

(𝑘,𝑙) = 𝑘𝑘(𝑙(𝒙𝒊−𝒌𝒊−𝟏− 𝒙𝒏−𝒌𝒏−𝟏)) with 0/0 considered as 0.

Once the experts are defined, the second part starts with the decision on which expert to trust or, alternatively, to follow an aggregation rule using them all to compute the final weights at each period. This issue is linked to the choice of the values for k and l as well. The solution is found using machine learning to combine all the experts depending on their past performance following the idea of Cesa-Bianchi and Lugosi (2006) discussed in more details in section 2.3.

First define a probability distribution {𝑞𝑘,𝑙} over the set of all pairs of positive integers (k,l) such that each pair has a corresponding probability 𝑞𝑘,𝑙 > 0. Each expert belonging to 𝐵(𝑘,𝑙)_{receives a} weight depending on its performance to date as if it would have been applied from the start as a constantly rebalanced portfolio:

𝑤𝑛,𝑘,𝑙= 𝑞𝑘,𝑙𝑠𝑛−1(𝑏(𝑘,𝑙))

(15)

13 𝑣𝑛,𝑘,𝑙 =

𝑤𝑛,𝑘,𝑙 𝛴𝑖,𝑗𝑤𝑛,𝑖,𝑗

Finally the aggregation happens simply as linear combination of all the weighted experts4 𝑏𝑛(𝒙𝟏𝒏−𝟏) = ∑ ∑ 𝑣𝑛,𝑘,𝑙𝑏𝑛(𝑘,𝑙)(𝒙𝟏𝒏−𝟏)

∞

𝑙=1 ∞

𝑘=1

The collection of the vector 𝑏𝑛(𝒙_𝟏𝒏−𝟏) at each period constitute the final portfolio scheme Bk_which represents the final output of the algorithm.

Györfi et Al (2006) proved the universality of this strategy Bk_{with respect to the class of ergodic} and stationary processes. The ultimate claim of this strategy, which represents the advance over previous sequential investment strategy, is to be able to find the hidden complex dependences in the past data and effectively exploit them to produce a rapid growth of capital.

4_{This way all the experts are used at any period. One can see it as a sort of bet in each single expert depending on}

(16)

14

2.3 LINK WITH MACHINE LEARNING

Introduced in the 50’s, Machine Learning (ML) is a subfield of artificial intelligence (AI). As outlined in Thagard (1990), AI seeks to enable computers to do tasks requiring intelligence when performed by people. Machine learning, in its broader definition, is the area of AI concerned with making computers increase their 'knowledge' and improve their performance. In other words ML algorithms iteratively learn from data to improve and to predict outcomes. As the training data used by the algorithms increase, it is possible to produce more precise models based on that data. ML is grounded on statistical learning, widely presented in James et Al (2013), that is a set of tools aimed to understand and model data, in particular to estimate the function f representing the systematic information that a set of predictors X provide on an output Y. Often as in stocks series such a relation lies in hidden dependencies which is very hard to spot with traditional estimation techniques (Györfi et Al, 2006).

In Portfolio allocation the task is to guess the next element of an unknown sequence given some knowledge about the past elements and possibly other available information. Since in this setting the distributions of the underlying price processes are unknown, one has to “learn” the optimal portfolio from past data, and eﬀective empirical strategies can then be derived using methods from nonparametric statistical smoothing and machine learning. Among the wide array of ML methods and classifications, due to the nature of the data generating process, this task belongs to online machine learning literature. These are methods used when data becomes available in a sequential order and the predictor for future data is therefore updated at each step, as opposed to techniques which generate the predictor using all the training dataset at once (Cesa-bianchi and Lugosi, 2006).

In particular the main idea here is inherited from the literature on “prediction with expert advice” originally introduced as a model of online learning by De Santis et Al. (1998). Basically it refers to a prediction problem where an algorithm has to make sequential predictions, disposing each time a set of reference forecasters, called experts, which it combines to make its own prediction. No assumption on the quality or independence of those experts predictions is made, apart relying on each expert’s past performance, therefore the logical aim for the algorithm is to make sure at each period of time it doesn’t perform much worse than the best expert up to that time (Blum 1998, Abernethy et Al. 2007).

As Györfi et Al. (2006) remark the idea of linking their algorithm to machine learning comes up when a choice on the values of k and l has to be made. There are infinite combinations but a trade off: with both k and l small the estimate will have a large bias since few matches a obtained while when they are both large there will be too many matches which induce high variance. However

(17)

15

ML allows to view the problem from a broader perspective: instead of directly choosing a certain pair k,l , many (infinite in theory) pairs can be all taken into account to generate the predicted value of the optimal weights for the next period. Each pair k,l generates an expert, the reference forecaster, then the algorithm combine all the experts via exponential smoothing using the performances to date of each expert as a weight. The weights are then normalized and the final prediction come from the weighted sum of the various experts. This key idea of experts combination follows straightforwardly the prediction with expert advice described above. More technically each pair k,l generates an expert b(k,l) which has an historical performance of

Sn-1(b(k,l)). The weights are the result of a two stage procedure: first by using exponential weighting , the technique with best theoretical and practical properties according to Györfi et Al. (2006), the past performance is considered together with a learning parameter (𝜼) and a probability distribution.

𝑤𝑘,𝑙 = 𝑞𝑘,𝑙∗ ⅇ(𝜂 log(𝑆𝑛−1(𝑏 (𝑘,𝑙)₎₎₎

In practice then the distribution is taken uniform ad 𝜂 is set to 1 so the formula simplifies to 𝑤𝑘,𝑙 = 𝑞𝑘,𝑙∗ 𝑙𝑜𝑔 (𝑠𝑛−1(𝑏 (𝑘,𝑙)))

Then these values are normalized and the weights summing to 1 are obtained (the vs). The experts with highest performance in the past will generate higher weights, that is they will have a larger “voting power” and the final prediction represents in some sense a “consensus” among all the different predictors (experts) (De Santis et Al. ,1998).

Györfi5_{proposes a further clear interpretation of this mechanism. The performance of a portfolio} (S(B)) allocated based on this algorithm can be seen, at each day n, as the result of sharing in the morning one dollar among various experts according to their past performance, leaving them to trade and then summing the money each one has obtained at the end of the day.

𝑆𝑛(𝐵) = ∑ 𝑞𝑘,𝑙∗ 𝑆𝑛𝑏(𝑘,𝑙) 𝑘,𝑙

This equation is easily derivable from the others presented in section 2.2 and this completes the framework of related literature to introduce the methodology. In the next section there will be an assessment with real data to check whether it keeps its promises even in practical application where other strategies have been affected by several issues.

(18)

16

SECTION 3 – IMPLEMENTATION

3.1 TECHNICAL ISSUES

Prior to the empirical test of the kernel-based strategy some remarks have to be made on the feasibility with the relative adjustments to switch from a theoretical model to a sound practical implementation algorithm.

In machine learning as well as in information theory the computational cost of prediction is also interesting, in this sense the algorithm can turn out to be quite heavy. In practical application this translates in a longer time needed to process the information and produce an output as well as a relevant energy consumption and the need of good quality hardware to be performed. The complexity arises from the fact many computations are required. In particular every single trading day requires to make an optimization for any expert, that is k*l times. This complexity rises quickly as the estimation and evaluation windows increase, the number of assets rises and more experts are used.

To reduce this complexity the practical application of the algorithm will be performed on a less computationally intense version. The idea is to use a semi-log-optimal strategy proposed by Vajda (2006) and then analyzed in Györfi, Urbán and Vajda (2007). The intuition is to replace the logarithm objective function (and the resulting convex problem) with its Taylor approximation which happens to be a quadratic programming task. This requires additional knowledge of conditional first and second moments. Gyorfi et Al (2007) showed that, since the second order Taylor expansion of log (z) at z=1 can be written as

ℎ(𝑧) = 𝑧 − 1 −1

2(𝑧 − 1) 2

then the semi-log-optimal strategy can be defined as 𝑏(𝑿𝟏𝒏−𝟏) = argmax

𝑏(.) 𝐸{h⟨𝑏(𝑿𝟏 𝒏−𝟏_{), 𝑋}

𝑛⟩ | 𝑿𝟏𝒏−𝟏}.

Vajda(2006) proved that the semi-log strategy has an almost optimal growth rate of capital under the assumption of stationary and ergodic markets compared to the kernel-based log-strategy of Gyorfi et al (2006). That’s why it is possible to integrate the two methodologies as done in Gyorfi et al (2007). Basically provided that the i-matches are determined in the same way as before, the difference is in the maximization using h(z) instead of log(z):

(19)

17 𝑏(𝑘,𝑙)(𝒙_𝟏𝒏−𝟏) = arg max

𝑏𝜖𝛥_𝑑 ∑ h(⟨𝑏, 𝑥𝑖⟩)

{𝑘<𝑖<𝑛:‖𝑥_𝑖−𝑘𝑖−1−𝑥_𝑛−𝑘𝑛−1‖≤𝑟𝑘,𝑙}

If the set of all i-matches is called Jn and the function h is the second order Taylor expansion, as showed above, then one need to determine the vector b which maximizes:

∑ h(⟨𝑏, 𝑥𝑖⟩) 𝑖∈𝐽𝑛 = ∑(⟨𝑏, 𝑥𝑖⟩ − 1) 𝑖∈𝐽𝑛 −1 2∑(⟨𝑏, 𝑥𝑖⟩ − 1) 2 𝑖∈𝐽𝑛 = ⟨𝑏, 𝑚⟩ − ⟨𝑏, 𝐶𝑏⟩ With m=∑𝑖∈𝐽_𝑛(𝑥𝑖− 1) and 𝐶 = 1

2∑𝑖∈𝐽𝑛(𝑥𝑖− 1)(𝑥𝑖− 1)T where 1 is intended as a vector of ones. The normal procedure has computational complexity in each step proportional to the number of matches, that is the elements in Jn. Instead if m and C are computed in advance with the semi-log strategy the complexity doesn’t depend anymore on the matches and then the result is computed in a much faster way6_.

As Gyorfi7_{outlines, this approach is not so far from the Markovitz mean variance but it is more} direct with less estimation problems. There are some similarities between mean variance and semi-log-optimal portfolio analyzed in Ottucsak and Vajda (2007). In short, the vector m and the matrix C are computed in a similar way than mean and covariance matrix, but to be the same the risk aversion λ needs to be a specific value which depends on the unknown distribution and changes over time in a multiperiod setting as this one, as a result the estimation via the equivalent Markovitz version is much more problematic and less accurate. Semi-log approach is more direct and works better since log (z) is really well approximated by h(z) which has a well-defined parameter of 1₂ which is then pre-multiplied to C.

Another practical issue arising from the implementation is the number of experts to choose, since they are infinite in theory but this is not feasible in practice. The choice partially follows that of the authors to use both k and l as integers from 1 to 5 for a total of 25 combination k,l possible at each period which translates in 25 experts each time. In some works even 5*10 combination are proposed but in this specific case adding 25 extra experts doesn’t improve the performance while the running time rises a lot, hence the choice of not using more than 25 experts.

The limited number of experts slightly loosen the constraint on the choice of the radius. The same radius than in previous works have been used for comparability purposes that is:

6_{The main result is proven by theorem 4.4 in Gyorfi et al (2007) which retrieves it from the 3.1 in Vajda (2006).} 7_{http://videolectures.net/mlss07_gyorfi_mlaf/#}

(20)

18

𝑟𝑘,𝑙2 = 0.0001 ⋅ 𝑑 ⋅ 𝑘 ⋅ 𝑙

Finally an assumption has to be made on the distribution of the probabilities 𝑞𝑘,𝑙 that are employed in the exponential smoothing with past performances in order to determine the weights to assign to each expert in the final aggregation. The starting point is quite a simple one, that is using an uniform probability distribution, the same choice of the authors:

𝑞𝑘,𝑙 = 1 𝐾 ⋅ 𝐿

Where 𝐾 ⋅ 𝐿 is the finite size array including all the pairs k,l available, hence the number of experts found.

The implementation is then preformed on the databases described in section 3.2 according to the methodology analyzed is section 3.3. The algorithm has been written using MATLAB software with script made of 2 parts: first the set up of the various parameters (number of assets, estimation window length, number of experts, starting date) and then a three-layers loop to determine the portfolio weights at each period. The script can be seen in the appendix.

(21)

19

3.2 DATA DESCRIPTION

The test is performed using quite different data than in the original paper. The reason is twofold. First Gyorfi et Al (2006) use way old data: a portfolio of 36 stocks from NYSE spanning 22 years till 1985. Instead it is more interesting to analyze and compare his results with very recent data. Second this helps to expand the literature on the topic since there are not so many similar works, as highlighted in the introduction.

Data analyzed are from the Center of Research in Security Prices (CRSP)8_{and they are the newest} possible since a wide 30-years span from 1989 till the end of 2018 is covered. Not all the data are fully used but it was necessary to dispose of series long enough to allow different starting points and estimation window lengths. Two main datasets have been created with various sorting operations through MATLAB software. Both includes daily data on stocks returns where the dividends are immediately reinvested. Returns are computed dividing the adjusted closing price of two subsequent trading days hence they are also called relative prices in the literature. They show increment or decrease whether they are higher or lower than 0, since the unit have been subtracted from the ratio of prices. The scope of interest are north American companies listed on NYSE, NASDAQ and NYSE MKT (previously known as AMEX).

The former dataset has been created sorting the stocks based on capitalization, first focusing on the best 30 stocks for average cap in the last 10 years and then shrinking the number of assets to 22 due to data availability. This dataset, named as “CAPdata”, has been chosen since it contains some of the largest companies of the world, hence the trading is quite accessible so that the strategies can respect some of the practical assumptions made in section 2.2 and therefore generate reliable results. The latter instead, named “VOLdata”, contains assets selected with a similar criterion, this time sorting by the largest average volume in the last 10 years. High volume means stocks are very liquid and then it is feasible to buy and sell even big amount every day. Some of the companies’ stocks appear in both datasets as one can intuitively guess but there are enough differences to make the distinction meaningful. A total of 7559 trading days for any asset is included in each dataset. However not all the trading days and the assets will be used in the single experiment, the number will be specified each time.

In addition to this, some sectoral datasets have been created, using the NAICS codes, a north American system to classify companies by sector by assigning each one a 6 digits code9_following a specific library. An introduction to this classification system is provided by Boettcher (1999). This sectoral-based approach is another novelty of this work and it is aimed to see whether there 8_{Thanks to Sant’Anna school}

(22)

20

is any specific field in which the algorithm is particularly effectively in finding and profiting from the dependencies among stocks. In particular the sectors investigated for which a specific dataset has been created are listed below together with the first two digits out of six identifying the sector according to the NAICS classification:

- Utilities ’22..’;

- Finance and insurance ’52..’; - Real estate ’53..’.

To keep things simple only a small number of companies have been selected: after sorting by capitalization, only the biggest have been considered provided the exclusion of some which didn’t dispose of the whole 30-years series.

The following table contains a summary of all the datasets presented together with their features.

Dataset Source N Time Period Frequency Criteria

CAPdata CRSP 22 1989/01/03-2018/12/31 Daily Capitalization VOLdata CRSP 22 1989/01/03-2018/12/31 Daily Volume Data22 CRSP 22 1989/01/03-2018/12/31 Daily Utilities

Data52 CRSP 14 1989/01/03-2018/12/31 Daily Finance and Insurance Data53 CRSP 7 1989/01/03-2018/12/31 Daily Real Estate

(23)

21

3.3 METHODOLOGY

The empirical test on the algorithm is made on past data, that is taking into account part of the 7559 trading days available. After determining a certain starting point, the following days till the end of 2018 constitute the holdout sample which also works as evaluation period. It is used to assess the out-of-sample performance of the strategy by updating each time the data with real ones as the single estimation is made, to be used for the next period. The selected method to use the data each time is called “rolling estimation window”, a widely used technique in the literature. It consist of keeping a fixed length estimation period by adding the last trading day data as they become available dropping at the same time the oldest observation available. To get a bigger picture, defining the starting date as t0 and an estimation window of t-days, data in the interval [t0-t, t0 ] are used to estimate the portfolio at t0+1. After that the effective returns in t0+1

from the holdout sample are included and the data of the oldest day t0-t dropped, so that it is

possible to estimate the portfolio at t0+2 and so on till T which is always the end of 2018. A total

number of T-t0 estimation days has to be covered, this is the length of the evaluation period. The

specific distinction of the time frame is summarized below:

To make the conclusions more robust different combination will be investigated: the number of assets d, the starting date t0 and the length of the estimation window t will vary to check how

these factors affects the results. The number of experts instead will be kept fixed to 25 in the next section since raising the experts didn’t provide any relevant improvement/change apart requiring an almost double computation time in previous attempts, hence it will not be reported. The semi-log-optimal kernel portfolio proposed is compared in terms of performance with a sample of other strategies form the literature. This makes the study more comprehensive and allows to better assess the performance of the algorithm since it provides some useful benchmarks. To be consistent, short selling is never allowed and neither is leverage, hence for any strategy at any time 𝑏𝑖 ∈ 𝛥𝑑 using previous notation. The first benchmark portfolio is the out-of-sample mean variance model (MV), originally from Markovitz (1952) linked to a utility function from an agent whose preferences are expressed in fully by mean and variance. The problem, at each period of time, is the following:

t0 -t

Estimation window t

Starting date t0 _T

(2018/12/31)

(24)

22 max 𝑏∈𝛥𝑑 𝑏′𝜇 −𝛾 2(𝑏 ′_𝛴𝑏)

where the true parameters (µ,Ʃ) of the returns moments are replaced by the sample mean and sample variance10_{. The risk aversion 𝛾 will be set to one following Bottazzi and Santi (2017) since} this makes the approach much closer to the semi-log kernel strategy as shown by Ottucak and Vajda (2007).

A second strategy considered is the minimum variance portfolio (MINV) introduced to overcome the huge estimation problems given by the mean returns, an issue clarified in the introduction. The setting is the following:

max 𝑏∈𝛥𝑑

𝑏′𝛴𝑏

where the notation is the same as before and the goal is to minimize the variance ignoring mean returns which should lead to a performance improvement as claimed by Jagannathan and Ma (2003).

Then a couple of naïve strategy are used, that is two versions of the equal weights portfolio. The former is a buy and hold strategy (EWbh) which simply consist of buying the same share of each asset available in the first period and then keeping the portfolio unchanged till the end, with the advantage of generating the least turnover and hence very few transaction costs. The latter instead is a constantly rebalanced version (EWcr) achieved by buying/selling at each period the assets so that a constant share of 1/d of the total portfolio value is kept in each of the d assets every time. Even if these strategies may seem less powerful and a-theoretical they are still very popular in practice, see for instance the case of participants in contribution plans by Benartzi and Thaler (2001), and even the performance can be very relevant, as Demiguel et Al (2009) shown claiming more sophisticated strategy never fully outperform the naïve ones on a number of experiments on different datasets.

Finally the last benchmark is an a-posteriori strategy which works as an even more useful benchmark than the others. It is a best constantly rebalanced portfolio in hindsight, that is the best portfolio among those which keeps the same distribution of wealth among a set of stocks from day to day. It is computed in the following way:

argmax 𝑏∈𝛥_𝑑 ∏

⟨𝑏, 𝑥_𝑖⟩ 𝑛

𝑖=1

(25)

23

Since the logarithm is an increasing transformation then the maximum stays the same but the product can be rewritten as a sum making the computation easier.

argmax 𝑏∈𝛥𝑑

∑ log ⟨𝑏, 𝑥𝑖⟩ 𝑛

𝑖=1

This strategy doesn’t need any estimation and should perform better than all the other benchmarks since it is done with all the realizations of the returns already available. However, since the kernel-based strategy here is universal not only for constantly rebalanced portfolios but for a broader class that is all stationary and ergodic processes, it is even possible to outperform it, especially when the estimation and evaluation windows become larger since the strategy can achieve its full potential. In Gyorfi et Al (2006) the ex post BCPR performs often less than the kernel log-optimal strategy.

The table below summarize all the strategies considered:

Up to now the word performance has been used in a broad sense, however the specification on how performance will be assessed in this study is necessary. This links with section two, that is the main purpose of the work is to provide a study on a methodology whose literature is aimed to provide a more practical approach than mean variance which might be suitable even for investors and portfolio managers. For this reason a risk-adjusted measure such as the Sharpe ratio won’t be considered. The main indicator will be the wealth generated by each portfolio at the end of the period starting from an initial capital S0=1, which can be defined as:

𝑠𝑛 = ∏ < 𝒃𝑖 , (1 + 𝒙𝑖) 𝑛

𝑖=1

>

Where 𝒙𝑖 represents the assets returns at time i, while 𝒃𝑖 the considered portfolio at the same period.

Symbol Strategy Role

BK

Kernel based semi-log-optimal portfolio tested strategy EWcr Equal weight portfolio constantly rebalanced benchmark

EWbh Equal weight portfolio buy&hold benchmark

MV Out-of-sample mean variance portfolio benchmark

MINV Minimum variance portfolio benchmark

(26)

24

Moreover to make meaningful comparisons among different length of the evaluation period, the whole capital growth will be annualized so that there is a common ground to immediately display the goodness of a specific setting of the algorithm variables. The measure used is the average annual yield (AAY) computed as:

𝐴𝐴𝑌 =𝑙𝑜𝑔 𝑆𝑛 𝑛

where in this case n denotes the number of years in the evaluation period.

Also the computation time will be considered for the kernel based strategy, just to give an idea of how the different settings affect the complexity of the algorithm, even for such a small experiment.

(27)

25

SECTION 4 – ANALYSIS AND DISCUSSION OF RESULTS

This section is aimed to show the empirical results obtained following the methodology previously explained and trying to provide some insights on the reasons of some performances. Follows a comparison with previous literature results as well as a summary of the weak points of the work.

As already stated, the scope of the analysis can turn out to be extremely wide since many variables affect the performance of the algorithm, in particular one needs to consider:

-the starting date (t0), it affects the length of the evaluation period11. Results from Gyorfi et Al

(2006) displayed a large difference in performance as the starting date goes back in time;

-the estimation window length (t), one may expect a larger window increases the number of training data available for the kernel to find similar periods in the series hence giving higher and more precise results;

-the number of assets (d), where an higher number of assets should make it more likely to find more predictable patterns and hence rise the performance, even if this is not guaranteed;

-the number of experts (K,L), as before more experts mean more reference predictors to combine in the Bk_{strategy, hence the performance should improve.}

Along the analysis all these relations will be validated or rejected according to the results, moreover the impact on computational time will be considered when switching to different settings of the parameters. Due to the large amount of possible combination only the most relevant results are shown.

The first dataset considered is CAPdata, where the most capitalized north American companies over the last years are included. Table 1 summarizes the performance showing both the final capital achieved (Sn ) and the annualized average yield (AAY), sectioning by starting date.

(28)

26

The setting by which Bk_{achieves the highest annual results is using a 800-days estimation}

window with 8 assets starting from 2002, in that case investing in 2002 1$ in a portfolio following the algorithm would have resulted in 105.80$ at the end of 2018. However other configurations both starting from 2002 and from 2009 achieve almost the same result, that is an average annual growth around 11.50%. Lowest performance are in general those under the 2014 column, earlier only one combination grows less than 9% per year which means when starting more in the past the results are not dramatically affected by the setting of parameters anymore.

Considering the effects of changes in a single parameter, rising the estimation window (t) in general improves the performance, given the same number of assets. However the gain is significant only when the window is small, as it happens moving from a 100-days window to 200 both starting from 2014 and 2009 with a performance improvement of more than 1% per year. Instead moving from 1000 to 1500 days basically gives the same results. When the starting point is more in the past the annual performance is barely not affected by t, even if taking into the account the larger span of application of the algorithm a small difference can be relevant, for instance making an investor get 115$ per dollar instead of 90$ when starting from 2002. The number of assets has a different pattern, its effect is different depending on the starting point: when the evaluation period is short (as from 2014) counterintuitively a larger number of assets reduce the performance while with a longer evaluation period (as from 2009 and 2006) more assets translates into higher growth. Nevertheless even the number of assets doesn’t affect the performance too much. The most relevant factor is instead the length of the evaluation period, that is the further in the past is placed the starting point the better the average performance. To this aim one can easily notice how in table 1 the results from 2009 are generally better than the same from 2014, even starting from 2006 and 2002 brings better result. The only exception might seem the performances from 2006 which are lower than those from 2009, however this can be explained by the non-negligible effect of the financial crisis. Even if these are big companies the

Table 1 – Sn and AAY for Bk using CAPdata dataset

T d 2014 _Sn _AAY 2009 _Sn _AAY T d 2006 _Sn _AAY 2002 _Sn _AAY 100 6 2.19 6.81% 6.055 7.82% 400 8 16.976 9.46% 90.469 11.51% 200 6 2.595 8.28% 9.112 9.60% 800 8 17.298 9.52% 105.797 11.91% 200 15 2.456 7.80% 9.77 9.90% 1500 8 17.145 9.49% 400 6 2.326 7.33% 11.344 10.55% 400 8 1.924 5.68% 11.172 10.48% 600 4 2.386 7.55% 13.413 11.28% 600 6 2.013 6.08% 13.858 11.42% 1000 8 2.088 6.39% 14.056 11.48% 1500 8 2.115 6.51% 14.572 11.64%

Final wealth (Sn) achieved for S0=1$ and annual average yield (AAY) of Bk strategy according to various settings, all the results are for K=L=5 hence 25 experts, till 31/12/2018, t=estimation window length d=number of assets considered, *less results for 2006 and 2002 also due to computational time needed

(29)

27

period 2007-2009 has been negative for all of them significantly reducing the overall performance of the algorithm when starting from 2006. Indeed then the annual growth form 2002 is slightly higher than that from 2009.

But the standalone results cannot tell all the story, that’s why it is presented a comparison among the performances of the semi-log kernel-based strategies against other portfolios from the literature, introduced in section 3.3. Table 2 is formatted so that cells are colored based on the value they contain, lower values are in red, higher in green. For simplicity only AAY are reported, also this time taking into account the starting date (t0), the estimation window (t) and the number

of assets (d).

The highest performance on average is almost always achieved by the Best constantly rebalanced portfolio, which is made already knowing the realizations of the returns at in any single period. Hence when an ex-ante portfolio approaches its performance then the result is a remarking one. The kernel based portfolio Bk_{is always slightly below the BCRP level, however as the starting}

date moves backward, it is closer and closer eventually with an almost equal average growth, especially when large estimation windows are used. Compared to other portfolios instead, even at a glance from the green color, Bk_{is by far the best strategy. Minimum variance and constantly}

rebalanced equal-weights portfolio perform really poorly, in relative terms, the pattern is also quite different form Bk_{since their results worsen when the strategy starts earlier. In particular in}

the case of MINV a larger estimation window has even a negative effect on the AAY. Also an higher number of assets, which should give the benefits of diversification ends up to penalize

Start t d Bk _MV _EWbh _EWcr _MINV _BCRP

2014 100 6 6.81% 5.10% 5.02% 4.70% 4.28% 9.72% 200 6 8.28% 4.78% 5.02% 4.70% 4.11% 200 15 7.80% 3.80% 3.47% 3.09% 3.39% 400 6 4.72% 4.43% 5.02% 4.70% 3.78% 400 8 5.68% 3.73% 3.78% 2.80% 2.42% 600 6 6.08% 1.79% 5.02% 4.70% 3.42% 1500 8 6.51% 7.09% 3.78% 2.80% 2.73% 2009 100 6 7.82% 5.96% 7.18% 6.52% 5.42% 11.66% 200 6 9.60% 6.11% 7.18% 6.52% 5.04% 200 15 9.90% 6.26% 5.37% 5.04% 4.48% 400 6 10.55% 6.40% 7.18% 6.52% 4.92% 400 8 10.48% 5.77% 6.30% 5.31% 4.03% 600 6 11.42% 6.95% 7.18% 6.52% 4.55% 1500 8 11.64% 11.78% 6.30% 5.31% 4.04% 400 8 9.46% 5.00% 5.02% 4.38% 3.54% 9.54% 2006 800 8 9.52% 8.11% 5.02% 4.38% 3.24% 1500 8 9.49% 9.68% 5.02% 4.38% 3.44% 2002 400 8 11.51% 7.59% 7.09% 4.43% 3.49% _11.94% 800 8 11.91% 9.80% 7.09% 4.43% 3.06%

Table 2 - color map expressing AAY of Bk and other strategies from the literature, high

(30)

28

these portfolios. The situation changes as one moves to the other two portfolios. EWbh generally has an intermediate performance and doesn’t go far from the algorithm when the estimation window is short, especially when implemented in later years (2009,2014).

The most interesting result in that of out-of-sample mean variance whose performance is widely affected by the estimation window length: when t is small it is outperformed even by the naïve portfolios but then it catches up till being the only portfolio able to do better than Bk_{when the}

window is widened to 1500 trading days. Moreover it even does slightly better than the BCRP. The key points here are the convergence of the performances of mean variance, kernel based strategy and BCRP when the estimation and evaluation windows are very wide. However Bk

performs, in general way better than other ex-ante portfolios, often achieving more than double yearly growth.

The algorithm performance has been really good up to now, however these results can be better understood if compared to previous works. Turns out in Gyorfi et al (2006), Gyorfi et al (2007) Bk

performs in quite a different way. At a first glance the performance they offer is spectacular, one gets roughly 500$ per dollar invested when the window considered is comparable to setting the starting point in 2006 (roughly 3200 trading days), moreover a comparable measure for the highest final capital here, which is 105 times the initial investment (when starting in 2002), is in their work the exponential figure of 5.500e+5. The authors show further impressive results but with much higher evaluation windows. In general they can guarantee performances of around 20-30% per year. However there are a few consideration to make which can partially explain this difference. First of all the authors use a slightly different technique than the one used here, letting the estimation window to increase, this may support the different speed of growth they found. When the window is short the growth is slower, as the window is large enough then capital increase in a much faster way. Moreover they use many assets (36 and 19 in two datasets) while here usually 6 or 8 are adopted and there is also to consider the different period of the datasets which span form the sixties to the eighties, while here it is much more recent involving the 2007-2008 crises which negatively affects the performance. Even from a computational point of view they probably dispose of much better hardware which allows to use more assets and wider windows. This aspect will be consider again later. There is also a technical argument: when Gyorfi et al (2006) go to analyze their results into the deep by showing the performance of pairs of stocks they find out an extremely huge benefit is given by the inclusion in the dataset of a certain asset12_{which somehow has a more predictable dynamics and the algorithm can} effectively exploit this opportunity. Therefore the choice of the assets can be fundamental too.

(31)

29

Finally there is a more general issue to focus, namely the different perspective of the two works: while the authors introduce a new methodology and then need to show it can offer results both in the finite horizon and asymptotically, here relying on the theory for the asymptotical part, the goal is to have a much practical oriented view and assess whether this methodology can be an useful tool, that is the focus is on the finite horizon with shorter estimation and evaluation windows. From this perspective the performance achieved with the results shown is not far from the comparable findings of the original paper and this is a very encouraging finding.

By the way, the most interesting aspect in Gyorfi et al (2006) is not related to the single performances but to their dynamics. As the authors state, their tables shows that Bk_{has a specific}

pattern: first an initial “learning phase” where the performance is already quite good and then the exponential growth starts at a certain point, that is after at least 4000-4500 trading days13_. Instead here this kind of pattern doesn’t appear: the results becomes better, as the starting point is set more in the past but there is a trend growth and not a specific point where things change dramatically. As already said the main reasons ca be found in the shorter period investigated as well as the slightly different practical technique adopted.

Moving to the analysis of the volume-based dataset, the main results are listed in table 3.

Table 3 – Sn and AAY for Bk_{using VOLdata dataset}

t d 2014 2009 T d 2006 2002

Sn AAY Sn AAY Sn AAY Sn AAY

200 6 3.937 11.90% 9.826 9.92% 400 8 16.47 9.36% 89.103 11.47% 200 15 2.154 6.66% 4.869 6.87% 800 8 16.05 9.27%

400 6 2.234 6.98% 12.66 11.02% 400 8 1.977 5.92% 13.052 11.16% 1500 8 1.769 4.95% 14.531 11.62%

Final wealth (Sn) achieved for S0=1$ and annual average yield (AAY) of Bk_{strategy according to various settings, all the results} are for K=L=5 hence 25 experts, till 31/12/2018, t=estimation window length d=number of assets considered. The lower number of results for 2006 and 2002 is due to computational time needed

The performances reported don’t show much differences compared to these of the other dataset (see table 1). The pattern is very similar in terms of the results for different starting periods. Instead the effects of rising the estimation window or the number of assets are more pronounced, the performance is reduced, often remarkably. This might be due to the dataset composition at a first glance, anyway the counterintuitive effects of the number of assets and estimation window length seems to partially invalidate the relations written at the begin of this section. The most striking result is that the shortest estimation window together with the shorter evaluation period, offer the best annual performance (11.90% yearly). This is just an exception to the general pattern but marks a significant difference with the other dataset.