A Combinatorial-Bandit Algorithm for the Online Joint Bid/Budget Optimization of Pay-per-Click Advertising Campaigns

(1)

A Combinatorial-Bandit Algorithm for the Online Joint

Bid / Budget Optimization of Pay-per-Click Advertising Campaigns

Alessandro Nuara, Francesco Trov`o, Nicola Gatti, Marcello Restelli

Dipartimento di Elettronica, Informazione e Bioingegneria

Politecnico di Milano, Milano, Italy

{alessandro.nuara, francesco1.trovo, nicola.gatti, marcello.restelli}@polimi.it

Abstract

Pay-per-click advertising includes various formats (e.g., search, contextual, and social) with a total investment of more than 140 billion USD per year. An advertising campaign is composed of some subcampaigns—each with a different ad—and a cumulative daily budget. The allocation of the ads is ruled exploiting auction mechanisms. In this paper, we pro-pose, for the ﬁrst time to the best of our knowledge, an al-gorithm for the online joint bid/budget optimization of pay-per-click multi-channel advertising campaigns. We formulate the optimization problem as a combinatorial bandit prob-lem, in which we use Gaussian Processes to estimate stochas-tic functions, Bayesian bandit techniques to address the ex-ploration/exploitation problem, and a dynamic programming technique to solve a variation of the Multiple-Choice Knap-sack problem. We experimentally evaluate our algorithm both in simulation—using a synthetic setting generated a Yahoo! dataset—and in a real-world application for two months.

Introduction

Online advertising has been given wide attention from the scientiﬁc world as well as from industry. in 2016 more than 72 billion USD has been spent in search advertising (IAB 2016), which represents about 50% of the total market. The development of automatic techniques is crucial both for the publishers and the advertisers, and artiﬁcial intelligence can play a prominent role in this context. In this paper, we focus on pay-per-click advertising—including different formats, e.g., search, contextual, and social—in which an advertiser pays only once a user has clicked her ad.

An advertising campaign is characterized by a set of subcampaigns—each with a potentially different pair ad/targeting—and by a cumulative daily budget. Remark-ably, a campaign may include subcampaigns on different channels, e.g., Google, Bing, Facebook. In pay-per-click ad-vertising, to get an ad impressed, the advertisers take part in an auction, specifying a bid and a daily budget for each subcampaign (Qin, Chen, and Liu 2015). The advertisers’ goal is to select these variables to maximize the expected revenue they get from the advertising campaign. The

Gen-eralized Second Price auction (GSP) is used for search

ad-vertising (King, Atkins, and Schwarz 2007). This auction

is not truthful—i.e., the best bid for an advertiser may be different from the actual value per click—, and learning al-gorithms are commonly used to learn the optimal bids. The

Vicrey-Clarke-Groves mechanism (VCG) is instead used for

contextual and social advertising (Varian and Harris 2014; Gatti et al. 2015). This auction is truthful only in the unreal-istic case in which the daily budget is unlimited.1_{Thus, each}

advertiser needs to ﬁnd the best bid and budget values in an online fashion, and this problem is currently open in the lit-erature. In the present paper, we provide a novel algorithm, based on combinatorial bandit techniques (Chen, Wang, and Yuan 2013), capable of automating this task.

Related works. Only a few works in the algorithmic eco-nomic literature tackle the campaign advertising problem by combining learning and optimization techniques. More specifically, Zhang et al. (2012) study a scenario similar to the one we analyze in this paper. The authors take into ac-count the problem of the joint bid/budget optimization of subcampaigns in an offline fashion. However, this algorithm suffers from some drawbacks. More precisely, the model of each subcampaign requires a huge number of parame-ters s.t. even achieving rough estimates requires a consid-erable amount of data, usually available only after the sys-tem has been running for several months. Furthermore, some parameters (e.g., the position of the ad for each impression and click) cannot be observed by an advertiser, not allow-ing the employment of the model in practice. Finally, the optimization problem formulated therein is nonlinear and finding a good-quality approximate solution requires long computational time. Thomaidou, Liakopoulos, and Vazir-giannis (2014) separate the optimization of the bid from that one of the budget, using a genetic algorithm to optimize the budget and subsequently applying some bidding strategies. Markakis and Telelis (2010) study the convergence of some bidding strategies in a single-subcampaign scenario.

Online learning results are known only for the restricted cases with a single subcampaign and a budget constraint over all the length of the campaign without temporal dead-lines.2 _{Ding et al. (2013) and Xia et al. (2015) work on a}

1_{Notably, the value per click is unknown at the setup of the} campaign, not allowing an advertiser to bid truthfully.

2

This last assumption rarely holds in real-world applications where, instead, results are expected by a given deadline.

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) The Thirty-Second AAAI Conference

(2)

ﬁnite number of bid values and exploit a multi-armed bandit approach. Trov`o et al. (2016) work on a continuous space of bids and show that assuring worst-case guarantees leads to the worsening of the average-case performance.

Less related works concern daily budget optimization (Xu et al. 2015; Italia et al. 2017), bidding strategies in display advertising (Wang, Zhang, and Yuan 2016; Weinan et al. 2016; Zhang, Yuan, and Wang 2014; Lee, Jalali, and Das-dan 2013) and video advertising (Geyik et al. 2016). Finally, some works deal with the attribution problem of the conver-sions in display advertising (Geyik, A-Saxena, and Dasdan 2014; Kireyev, Pauwels, and Gupta 2016).

Original contributions. We formulate the optimization problem as a combinatorial-bandit problem (Chen, Wang, and Yuan 2013), where the different arms are the bid/budget pairs. In the standard multi-armed bandit problem, we are given a set of options called arms, at each turn we choose a single arm, and we observe only its reward. Differently, in a combinatorial-bandit problem we are given a set of

su-perarms, i.e., an element of the power set of the arms—here

corresponding to a combination of bid/budget pairs for each subcampaign—, whose elements satisfy some set of com-binatorial constraints—in our case, knapsack-like. At each round, we simultaneously play of all the arms contained in the selected superarm and, subsequently, observe their rewards. We use Gaussian Process (GP) regression mod-els (Rasmussen and Williams 2006) to estimate, for each subcampaign, the expected daily number of clicks for each bid/budget pair and the value per click. We estimate the value per click of a subcampaign separately from the number of clicks since, while the number of daily clicks and, thus, of the observed samples is usually large allowing one to obtain accurate estimates in short time, the acquisitions are usually much more sporadic and the estimation of the value per click may require a longer time. As a result, at the very beginning of the learning phase our algorithm maximizes the number of clicks, whereas, subsequently, the objective function is gradually tuned by more accurate estimates of the values per click of each subcampaign. This rationale is the same one currently followed by human experts. We design two Bayesian bandit techniques to balance exploration and ex-ploitation in the learning process, that return samples of the stochastic variables estimated by the GPs. Finally, we dis-cretize the bid/budget space, and we formulate the optimiza-tion problem as a special case of Multiple-Choice Knapsack problem (Sinha and Zoltners 1979) in which we use the sam-ples returned by the bandit algorithms, and we solve it in polynomial time by dynamic programming in a fashion sim-ilar to the approximation scheme for the knapsack problem. The optimization is repeated every day.

We experimentally evaluate, using a realistic simulator based on the Yahoo! WebscopeA3 dataset, the convergence

of our algorithm to the optimal (clairvoyant) solution and its regret as the size of the problem varies. Furthermore, we evaluate our algorithm in a real-world campaign with sev-eral subcampaigns in Google AdWords for two consecutive months, obtaining the same number of acquisitions obtained by human experts, but halving the cost per acquisition.

Problem Formulation

We are given an advertising campaignC = {C₁, . . . , CN}, with N ∈ N, where Cj is the j-th subcampaign, a fi-nite time horizon of T ∈ N days, and a spending plan B = {y1, . . . , yT}, where yt∈ R+is the cumulative budget one is willing to spend at dayt ∈ {1, . . . , T }.3,4While the proposed definition is general w.r.t. the channel we target, in the specific case of search engines (where the subcampaigns are commonly targeted to multiple keywords), we assume that each subcampaign has been set s.t. its keywords behave similarly. As a consequence, a single decision for each group of keywords is required. For dayt ∈ {1, . . . , T } and for

ev-ery subcampaignCj, the advertiser needs to specify the bid

xj,t ∈ [xj,t, xj,t], where xj,t,xj,t ∈ R+ are the minimum and the maximum bid we can set, respectively, and the bud-getyj,t∈ [y_j,t, yj,t], where y_j,t,yj,t∈ R+are the minimum and the maximum budget we can set, respectively. The goal is, for every dayt ∈ {1, . . . , T }, to ﬁnd the best values of

bids and budgets, maximizing the subcampaigns cumulative expected revenue. These values can be found by solving the following optimization problem:

max xj,t,yj,t N j=1 vjnj(xj,t, yj,t) (1a) s.t. N j=1 yj,t≤ yt (1b) xj,t≤ xj,t≤ xj,t ∀j (1c) y_j,t≤ yj,t≤ yj,t ∀j (1d)

wherenj(xj,t, yj,t) is the expected number of clicks given the bidxj,tand the budgetyj,tfor subcampaignCjandvj is the value per click for the subcampaign Cj. Basically, this is a special case of Multiple-Choice Knapsack prlem (Kellerer, Pferschy, and Pisinger 2004) in which the ob-jective function (1a)—corresponding to the value provided by knapsack—is the weighted sum of the expected number of clicks of all the subcampaigns, where the weights are the subcampaigns’ value per click. Constraint (1b) is a budget constraint, forcing one not to spend more than the budget limit, while constraints (1c) and (1d) deﬁne the ranges of the variables. Similarly to the knapsack problem, here we have items–i.e., the subcampaigns—, each of which is char-acterized by a value and requires a portion of the budget. The differences are: the occupancy of an item is not a constant, being controllable by the assigned budget; we can decide on a further parameter, the bid, that does not have a correspond-ing parameter in the knapsack problem; the value of each item is not constant, but it depends on the decisions taken.

Since the function of the number of clicksnj(·, ·) and the

parameter specifying the value per clickvj need to be es-3_{We assume that campaign}_{C and spending plan B are given.} 4

For the sake of presentation, from now on we set the day as unitary temporal step of our algorithm. The use of shorter time units can be used to tackle situations in which the user’s behaviour is non-stationary over the day. The application of the proposed al-gorithm to different time units is straightforward.

(3)

timated online, not being a priori known, the optimization problem can be naturally formulated in a sequential decision learning fashion (Cesa-Bianchi and Lugosi 2006), or, more precisely, as a combinatorial bandit problem (Chen, Wang, and Yuan 2013).5_{Here, we would like to gather as much}

in-formation as possible about the stochastic functions during the operational life of the system and, at the same time, we do not want to lose too much revenue in doing so (a.k.a. ex-ploration/exploitation dilemma). More precisely, the avail-able options (a.k.a. arms) are all the values of bidxj,tand budget yj,t satisfying the combinatorial constraints of the optimization problem, while nj(·, ·) and vj are stochastic functions deﬁned on the feasible region of the variables that we need to estimate during the time horizonT . A policy U

solving such a problem is an algorithm returning, for each day t and subcampaign Cj, a bid/budget pair (ˆxj,t, ˆyj,t). Given a policyU, we deﬁne the pseudo regret as:

RT(U) := T G∗− E _T t=1 N j=1 vjnj(ˆxj,t, ˆyj,t) ,

where G∗ := N_j=1vj nj(x∗j, y∗j) is the expected value provided by a clairvoyant algorithm, the set of bid/budget pairs{(x∗_j, y∗j)}Nj=1is the optimal clairvoyant solution to the problem in Equations (1a)–(1d), and the expectationE[·] is taken w.r.t. the stochasticity of the policyU. Our goal is the design of algorithms minimizing the pseudo regretRT(U).

Proposed Method

Initially, we provide an overview of our algorithm named AdComB—Advertising Combinatorial Bandit algorithm—, and, subsequently, we describe in detail the phases compos-ing the algorithm.

The Main Algorithm

Algorithm 1 reports the high-level pseudocode of our method. For the sake of presentation, we distinguish three phases that are repeated each dayt (see Fig. 1). For each

subcampaignCj, the parameters to the algorithm are: a ﬁ-nite setXj of feasible bid values, a ﬁnite setYjof feasible budget values, a modelM(0)_j capturing a prior knowledge about the functionnj(·, ·) and of the parameter vj, a spend-ing planB and a time horizon T .

In the ﬁrst phase (Lines 4–8), named Estimation in Fig. 1, the algorithm learns, from the observations of days

{1, . . . , t − 1}, the model Mjof the user behavior for each subcampaignCj. More precisely, the modelMjprovides a probability distribution over the number of clicksnj(x, y)

as the bid x and the budget y vary and over the value per

clickvj. The ﬁrst day the algorithm is executed, no obser-vation is available, and thus the modelM_j is based on the priorM(0)_j (Line 5). Conversely, the subsequent days, for each subcampaign Cj, the algorithm also gets the obser-vations corresponding to dayt − 1 (Line 7) including: the

5

Another approach to solving the problem is to use a multistage method, e.g., backward induction, but it would require a huge com-putational effort that makes the problem intractable.

Algorithm 1AdComB

1: Parameters: sets_{X_j_}N_j=1of bid values, sets{Yj}Nj=1of budget values, prior

model{M(0)j }Nj=1, spending planB, time horizon T

2: for_{t ∈ {1, . . . , T } do} 3: forj ∈ {1, . . . N} do 4: ift = 1 then 5: Mj← M(0)j 6: else 7: Get(˜nj,t−1, ˜cj,t−1, ˜rj,t−1, ˜vj,t−1) 8: Mj← Update (Mj, (ˆxj,t−1, ˆyj,t−1, ˜nj,t−1, ˜cj,t−1, ˜rj,t−1, ˜vj,t−1)) 9: Xj,t← Xj∩ [xj,t, xj,t] 10: Yj,t← Yj∩ [y_j,t, yj,t] 11: (nj(·, ·), vj) ← Sampling (Mj, Xj,t, Yj,t) 12: {(ˆxj,t, ˆyj,t)}j∈N← Optimize ({nj(·, ·), vj, Xj,t, Yj,t}j∈N, yt) 13: Set({(ˆxj,t, ˆyj,t)}j∈N) 6HDUFK 6RFLDO &RQWH[WXDO 2SWLPL]DWLRQ 6SHQGLQJ3ODQ %DQGLW &KRLFH (VWLPDWLRQ

Figure 1: The information ﬂow in theAdComB algorithm along the three phases.

actual number of clicks ˜nj,t−1, the actual total cost of the subcampaign ˜cj,t−1, the time when the daily budgetyt ﬁn-ished ˜rj,t−1, if so, and the actual value per click ˜vj,t−1. Sub-sequently, the model of each subcampaignMj is updated using those observations (Line 8).

In the second phase (Lines 9–11), named Bandit Choice in Fig. 1, the algorithm chooses the values for the function

nj(·, ·) and the parameter vj using the modelMj just up-dated. More precisely, for each subcampaign Cj, the algo-rithm initially selects the bidsXj,t := Xj∩ [xj,t, xj,t] and budgetsYj,t:= Yj∩ [y_j,t, yj,t] that are feasible according to the given ranges (Lines 9–10). Subsequently, the algorithm chooses, according to the probability distributions of Mj, the samples of the functionnj(·, ·) for the feasible values of bid and budget inXj,tandYj,tand forvj(Line 11).

In the third phase (Lines 12–13), named Optimization in Fig. 1, the algorithm uses the values ofnj(·, ·) and of vj as parameters of the problem in Equations (1a)–(1d) and solves this problem returning the bid/budget pairs to be set for the current dayt (Line 13) in each different channel (denoted by Search, Social and Contextual in Fig. 1).6

In what follows, we provide a detailed description of the modelM_jand of the subroutinesUpdate(·), Sampling(·),

6

Notice that, since the bid and the budget can assume a ﬁnite set of values, the problem in Equations (1a)–(1d) can be easily for-mulated as a Mixed Integer Linear Program (see the Supplemental Material for the mathematical programming formulation).

(4)

andOptimize(·) used in Algorithm 1.

Model and Update Subroutine

As mentioned before, the goal of this subroutine is the estimation of the functions nj(·, ·) and the parameter vj. The crucial issue, in this case, concerns the employment of a practical estimation model, providing a good tradeoff between accuracy and time needed for the learning pro-cess. For instance, let us observe that a straightforward ap-proach employing independent estimates ofnj(·, ·) for ev-ery bid/budget pair (x, y) is not practical since it would

re-quire a huge amount of observations, and, thus, too many days to have accurate estimates. Suppose, for instance, to use 10 bid values and 10 budget values, with a total num-ber of 100 bid/budget pairs. Such a discretization would require a period of 100 days only to have a single obser-vation per estimates and years to have accurate estimates, thus making the algorithm useless in practice. Most of the methods for combinatorial bandits available in the state of the art (Chen, Wang, and Yuan 2013; Chen et al. 2016; Gai, Krishnamachari, and Jain 2010; Onta˜n´on 2017) suffers from the same issue, not exploiting any correlation among the random variables corresponding to the arms rewards.

To address this issue, we assume that the functionnj(·, ·) presents some regularities and, in particular, that the values of the function at different points in the bid/budget space are correlated. We capture this regularity resorting to GPs (Ras-mussen and Williams 2006). These models, developed in the statistical learning ﬁeld, express the correlation of the nearby points in the input space exploiting a kernel function. More-over, they provide a probability distribution over the output space—in our case the number of clicks—for each point of the input space—in our case the space of bid and budget—, thus giving information both on the expected values of the quantities to estimate as well as their uncertainty.

In particular, we propose two approaches fornj(·, ·). A

straightforward approach, which will be used as a baseline in our experimental activity, employs a single GP deﬁned on a 2-dimension input space (details are provided in the Sup-plemental Material). Even if this method provides a ﬂexible way of modeling the advertising phenomenon, it requires, due to the curse of dimensionality, a long initial phase before being effective. This issue is addressed by our second ap-proach which exploits an assumption on the structure of the problem. More precisely, the dependency ofnj(x, y) from bidx and budget y is modeled by two 1-dimensional GPs

combined in a nonlinear fashion. Formally, we assume:

nj(x, y) := nsat_j (x) min 1, y csatj (x) , (2)

where the two GPs employed for each subcampaignCjare: – the maximum number of clicks_nsat_j : X_j → R+that can be obtained with a given bidx without any budget constraint

(or equivalently if we lety → +∞);

– the maximum cost incurred_csat_j : X_j → R+with a given bidx without any budget constraint, as above;

where the bid space is deﬁned asX_j := ∪T_t=1[x_j,t, xj,t]. The rationale behind this decoupled model is that, given a bid

x, the number of clicks increases linearly in the budget y

where the coefﬁcient is the average cost per click givenx

(i.e.,nsatj (x)

csat

j (x)), until the maximum number of obtainable clicks

is achieved. Notice that the values ofnsat_j (x) and csat_j (x)

de-pend on the average position in which the ad is displayed when bidx is used and on the daily number of auctions. The

largerx the larger nsat_j (x) and csat_j (x).

Now we focus on the modeling of the maximum number of clicksnsat_j (·) with a GP regression model. The application

of such techniques to estimate the maximum costcsat_j (·) is

analogous. We modelnsat_j (·) in a subcampaign Cjwith a GP overX_j, i.e., we use a collection of random variables s.t. any finite subset has a joint Gaussian distribution. Following the definition provided in (Rasmussen and Williams 2006), a GP is completely specified by its meanm : Xj→ R and covari-ancek : Xj× Xj → R functions. Hence, we denote the GP that models the maximum number of clicks inCjas follows:

nsatj (x) := GP (m(x), k(x, ·)) , ∀x ∈ Xj.

If we have a priori information about the process, we can use it to design a function m(x) over the input space Xj which speciﬁes the initial mean value, e.g., if we have in-formation about the maximum number of clicks we might reachθ for any bid, one might consider a linearly increasing

function over the bid space asm(x) = _maxθ_t_x_j,tx. If no a priori information is available, we set_{m(x) = 0, ∀x ∈ X}_j.

At the beginning of the optimization procedure (t = 1),

we have the same predictive distribution at each point of the input space, i.e., nsat_j (x) ∼ N (m(x), k(x, x)), where we

denote withN (μ, σ2) the Gaussian distribution with mean

μ and variance σ2. For each dayt we obtain a value for the

maximum number of clicks by relying on: ˜nsat

j,t:= d(˜rj,t, ˜nj,t),

whered(·, ·) is a function specifying the distribution of the

clicks over the day. The function d(·, ·) can be estimated

from historical data coming from past advertising campaigns of products belonging to the same category (e.g., toys, in-surances, beauty products). The vector of the bid set so far

ˆxj,t−1 := (ˆxj,1, . . . , ˆxj,t−1)T and the vector of maximum number of clicks ñsat_j,t−1:=ñsat_j,1, . . . , ñsatj,t−1

_T

are used by the algorithm to refine the initial prior model M(0)_j . From the definition of GP, its restriction over a finite number of points is a multivariate Gaussian random variable, which can be used, for every x ∈ Xj, to predict the expected value

μj,t−1(x) and variance σ2_j,t−1(x) of the maximum number

of clicks in the following way:

μj,t−1(x) = m(x) + K(x, ˆxj,t−1)Φ−1n˜satj,t−1−

(m(ˆxj,1), . . . , m(ˆxj,t−1))T, σ2_j,t−1(x) = k(x, x) − K(x, ˆxj,t−1) Φ−1K(x, ˆxj,t−1)T, where we deﬁne Φ := K(ˆxj,t−1, ˆxj,t−1) + σn2I, I is the identity matrix of ordert − 1, and the (i, h)-element of the

(5)

the i-th element of the generic vector x and the h-th

ele-ment of the generic vectorx. Hence, the maximum num-ber of clicks for the bid x is distributed as the Gaussian N (μj,t−1(x), σ2j,t−1(x)).

Regarding the value per clickvj, at dayt we estimate it by exploiting the data ˜vj,hrecorded during the daysh < t.

Re-sorting to the Central Limit theorem, we have that the mean value per click vj is asymptotically Gaussian distributed, thus, at each dayt, it is sufﬁcient to estimate its mean νj,t and varianceφ2j,tas follows:

νj,t−1 := _t−1 h=1˜vj,h t − 1 , φ 2 j,t−1:= _t−1 h=1(˜vj,h− νj,t−1)2 (t − 1)(t − 2) . Overall, the modelM_jcorresponding to a subcampaign

Cjat a dayt consists of the following vectors: the values per click ˜v_j,t−1:= (˜vj,1, . . . , ˜vj,t−1)T, the selected bidsˆxj,t−1, the maximum number of clicksnsat_j,t−1, and the maximum costscsat_j,t−1(deﬁned similarly tonsat_j,t−1). Therefore, the Up-date subroutine of Algorithm 1 includes the incoming data (˜nj,t−1, ˜cj,t−1, ˜rj,t−1, ˜vj,t−1), properly transformed, in the aforementioned vectors.7

Sampling Subroutine

The models M_j we estimate for each subcampaign pro-vide a probability distribution over the functionnj(·, ·) and

of the valuesvj and, therefore, over the possible instances of the optimization problem in Equations (1a)–(1d). The Sampling subroutine generates, fromMj, a single instance of the optimization problem, assigning a value tonj(x, y) for every x ∈ Xj, y ∈ Yj and a value to vj. In the present paper, we propose two novel Bayesian approaches, namelyAdComB-TS and AdComB-BUCB, taking inspira-tion from the Thompson Sampling (TS) algorithm (Thomp-son 1933) and the BayesUCB algorithm (Kaufmann, Capp´e, and Garivier 2012), respectively. We resort to the Bayesian approach since, in most of the bandit scenarios, it leads to better performance than that one of their frequentist coun-terparts, see, e.g., (Chapelle and Li 2011; Granmo 2010; May et al. 2012; Paladino et al. 2017).

The AdComB-TS algorithm generates the values for

nsat_j (x) and csat_j (x) by drawing samples from the posterior

distributions provided by the GPs. More formally, at a given dayt for each bid in x ∈ Xj,t, we draw a sample fornsatj (x) and a sample forcsatj (x) as follows:

nsatj (x) ∼ N (μj,t−1(x), σ2j,t−1(x)),

csat_j (x) ∼ N (ηj,t−1(x), s2_j,t−1(x)),

whereηj,t−1(x) and s2_j,t−1(x) are the mean and the variance

for bidx, respectively, estimated by the GP modeling csat_j (x)

(we recall that the deﬁnitions ofμj,t−1(x) and σ_j,t−12 (x) are 7

The computation cost of the proposed solution can be dramat-ically reduced by using an alternative, but much more involved, so-lution where the inverse of the Gram matrixK(ˆxj,t−1, ˆxj,t−1)−1

is stored and updated iteratively at each day; see (Bishop 2006).

provided in the previous section). Similarly, for the value per click we draw a new sample:

ˆvj∼ N (νj,t−1, φ2j,t−1).

Conversely, the AdComB-BUCB algorithm generates samples for nsatj (x) and csatj (x) exploiting different time-varying quantiles of the posterior distributions. More pre-cisely, we use a high quantile (of order 1−1_t) for nsatj (x) andvjand a low quantile (of order 1_t) forcsatj (x). This as-sures us to generate optimistic bounds, that are necessary for the convergence of the algorithm to the optimal solution. Let us denote withq(μ, σ2, p) the quantile of order p of a

Gaussian distribution with meanμ and variance σ2. At day

t, for each bid x ∈ Xj,t, we generate samples fornsatj (x) andcsatj (x) as:

nsat_j (x) = q μj,t−1(x), σ2_j,t−1(x), 1 −1 t , csatj (x) = q ηj,t−1(x), s2j,t−1(x), 1 t ,

assuring that nsatj (x) is a high-probability upper bound for the maximum number of clicks and csat_j (x) is a

high-probability lower bound for the maximum costs. Similarly, for the value per clickvjwe assign:

ˆvj = q νj,t−1, φ2j,t−1, 1 − 1 t .

Finally, given the values for nsatj (x) and csatj (x) gener-ated by one of the two aforementioned methods, we compute

nj(x, y) as prescribed by Equation (2) for each x ∈ Xj,tand for eachy ∈ Yj,t.

Optimize Subroutine

Finally, we need to decide a single bid/budget pair to set at dayt for each subcampaign Cj. We resort to a modified version of the algorithm in (Kellerer, Pferschy, and Pisinger 2004) used for the solution of the knapsack problem. For the sake of simplicity, let us assume we set an evenly spaced discretizationY of the daily cumulative budget ytand that the feasible values for the budget are a subset of such a dis-cretization, i.e.,Yj,t ⊆ Y, ∀j, t. At first, for each value of budgety ∈ Yj,twe definezj(y) ∈ Xj,tas the bid maximiz-ing the number of clicks, formally:

zj(y) := arg max_x∈X

j,tnj(x, y).

The valuezj(y) is easily found by enumeration. Then, for each value of budgety ∈ Y we deﬁne wj(y) as the value we

expect to receive by setting the budget of subcampaignCj equal toy and the bid equal to zj(y), formally:

wj(y) :=

ˆvjnj(zj(y), y) y_j,t≤ y ≤ y_j,t

0 y < y_j,t∨ y > y_j,t .

This allows one to discard x from the set of the variables

of the optimization problem deﬁned in Equations (1a)–(1d), letting variablesy the only variables to deal with.

(6)

Finally, the optimization problem is solved in dynamic programming fashion. We use a matrixM (j, y) with j ∈ {1, . . . , N} and y ∈ Y . We ﬁll iteratively the matrix as

fol-lows. Each row is initialized asM (j, y) = 0 for every j and y ∈ Y . For j = 1, we set M (1, y) = w1(y) for every y ∈ Y ,

corresponding to the best budget assignment for every value ofy if the subcampaign Cj were the only subcampaign in the problem. Forj > 1, we set for every y ∈ Y :

M (j, y) = max y_∈Y,y_≤y M (j − 1, y) + wj(y − y) .

That is, the value in each cell M (j, y) is found by

scan-ning all the elements M (j − 1, y) for y ≤ y, taking the

corresponding value, adding the value given by assigning a budget ofy − yto subcampaignCj and, ﬁnally, taking the maximum among all these combinations. At the end of the recursion, the optimal value of the optimization problem can be found in the cell corresponding to max_y∈Y M (N, y). To

ﬁnd the optimal assignment of the budget, it is sufﬁcient to also store the partial assignments of the budget correspond-ing to the optimal value. The complexity of the aforemen-tioned algorithm isO(N H2), i.e., it is linear in the number

of subcampaignsN and quadratic in the number of different

values of the budgetH := |Y |, where | · | is the

cardinal-ity operator. (Let us observe that, although the complexcardinal-ity is polynomial inH, it is pseudopolynomial in y_j,t.) When

H is huge, the algorithm could require a long time. In that

case, it is sufﬁcient to reduceH by rounding the values of

the budget as in the FPTAS of the knapsack problem. This produces a (1− ε)-approximation of the optimal solution.

Experimental Evaluation

We experimentally evaluate our algorithm both in a synthetic setting, necessary to evaluate the convergence and the regret of our algorithm, and in a real-world setting, necessary to as-sess its effectiveness when compared with the performance of human experts.

Evaluation in the Synthetic Setting

The synthetic setting we use has been generated according to data of the Yahoo! WebscopeA3 dataset using the simulator

developed by Farina and Gatti (2017).

Experiment1. We consider N = 4 subcampaigns, each with a variable number of daily auctions drawn from z withz ∼ N (1000, 10). Each auction presents 5 slots and

10 advertisers. Each subcampaign is associated with a dif-ferent truncated Gaussian distribution that is used at every dayt to draw the bid of each ad, while, in the same way,

the ads’ click probability is drawn from a Beta distribution; see (Farina and Gatti 2017) for details. We set a constant cu-mulative budget per day y_t = 100 over a time horizon of T = 100 days, with limits y_j,t = 0, yj,t = 100 for every

t, j, and we set bid limits to x_j,t= 0 and xj,t= 1 for every t, j. Furthermore, we use an evenly spaced discretization of |Xj| = 5 bids and |Yj| = 10 budgets over the aforemen-tioned intervals. The values per click of each subcampaign are constant—thus letting the clicks to be the only source of randomness—and drawn uniformly from [0, 1]. We assume a

uniform distribution of the clicks over the day, thus the func-tiond(·, ·) has the following expression (˜rj,his expressed in hours):

d(˜rj,h, ˜nj,h) := _˜rj,h24 ˜nj,h.

We compare theAdComB-TS and AdComB-BUCB al-gorithms with:8

• AdComB-2D-TS, a version of AdComB using a single

GP over the two dimensional bid/budget space to estimate the number of clicks (see the Supplemental Material);

• AdComB-Mean, a version of AdComB selecting at each

day the average valuesμj,t−1(x) and ηj,t−1(x) for bid x

andνj,t−1to be used in the optimization procedure. For the GPs used in the algorithms, we adopt a squared ex-ponential kernel of the form:

k(z, z) := σ_f2 exp −(z − z)2 l ,

whereσf, l ∈ R+ are kernel parameters, whose values are chosen as prescribed by the GP literature, see (Rasmussen and Williams 2006) for details.

In addition to the pseudo regretRt(U), we also evaluate

the instantaneous reward which is deﬁned as:

Pt(U) := N j=1

vjnj(ˆxj,t, ˆyj,t). We average the results over 100 independent runs.

In Fig. 2a, we reportPt(U) of the 4 algorithms, while, in Fig. 2b, we report their average Rt(U). By inspecting the

instantaneous reward, we can see that all the algorithms but AdComB-Mean present a slightly varying reward even at the end of the of the time horizon since they incorporate the variance of the GP as information to select the budget over time. This variance is larger at the beginning of the process, thus incentivising exploration, and it fades as the number of observations increases, allowing the algorithm to reach the optimum asymptotically. Moving to Fig. 2b, we observe that the 2 Bayesian algorithms, namely AdComB-TS and AdComB-BUCB, present essentially the same per-formance, providing the best regret for every t ≤ T . For t ≤ 20 the regret of AdComB-Mean is essentially the same

one of the 2 Bayesian algorithms. Instead, for larger values oft its relative performance worsens reaching, at t = 100,

a regret about 35% larger than the one of the AdComB-BUCB algorithm. This is because the exploration performed byAdComB-Mean is not sufﬁcient. As a result, in all the runs, AdComB-Mean does not change the policy for, ap-proximately, t ≥ 40 and, in some of the 100 independent

runs, it gets stuck in a suboptimal solution, as we can ob-serve in Fig. 2a. Conversely, the 2 Bayesian algorithms,

8

The comparison with the algorithm by Chen, Wang, and Yuan (2013)—in a version accounting for Gaussian distributions— is unfair. Indeed, this algorithm requires50 days to have a single sample per random variable, and, for allt ≤ T , it purely explores the space of arms without any form of exploitation.

(7)

0 20 40 60 80 100 10 12 14 16 18 t Pt (U ) AdComB-TS AdComB-Mean AdComB-BUCB AdComB-2D-TS (a) 0 20 40 60 80 100 0 100 200 300 t Rt (U ) (b) 5 10 20 100 150 200 250 300 |Xj| RT (U ) (c) 3 4 5 6 100 200 300 N (d)

Figure 2: Results for the synthetic setting: instantaneous reward of Experiment 1 (a), pseudo regret over time for Experiment 1 (b), pseudo regret at the end of the time horizon for Experiment 2 (c) and Experiment 3 (d). The optimum value of the instantaneous rewardG∗is represented with a dashed line in (a).

thanks to a wider exploration, converge to the optimal solu-tion asymptotically in all the runs (the phenomenon is more evident on a longer time horizon, see the Supplemental Ma-terial). Finally, we observe that AdComB-2D-TS suffers from a larger regret than that one of the other 3 algorithms— more than 100% compared with the regret ofAdComB-TS andAdComB-BUCB at t = 100—and this is mainly accu-mulated over the ﬁrst half of the time horizon.

Experiment2. The experimental setting is the same used above, except that we use |X_j| ∈ {5, 10, 20}, s.t. the set of bids we used Xj with |Xj| = 20 includes Xj with

|Xj| = 10 that, in its turn, includes Xj with |Xj| = 5. In Fig. 2c we report the averageRT(U) of the 4 algorithms. Even if increasing the number of bid values may allow one to increase the value of the optimal solution, we observe that the regret slightly increases as |X_j| increases for the AdComB-TS and AdComB-2D-TS algorithms. This is be-cause the algorithms pays a larger exploration cost. Remark-ably, the extra cost from|X_j| = 10 to |X_j| = 20 is small. This result shows that the performance of the algorithms is robust to an increase of the number of possible bids.

Experiment3. The experimental setting is the same used in the Experiment 1, except thatN ∈ {3, 4, 5, 6}. In Fig. 2d,

we report the average regretRT(U) of the 4 algorithms. All the algorithms (exceptAdComB-2D-TS) do not suffer from a signiﬁcant increase in the regret as the number of the sub-campaigns increases, showing that they scale well as the number of subcampaigns increases.

Evaluation in a Real-world Setting

We advertised a campaign for a loan product of a large Ital-ian ﬁnance company with theAdComB-TS algorithm (the names of the product and the company, and other details omitted below are not provided due to reasons of indus-trial secrecy). The advertising campaign was composed of

N = 13 subcampaigns. The campaign has been advertised

for T = 120 days (4 months) in 2017 during which no

further advertising campaigns (e.g., video, radio, television) were conducted to avoid mutual effects between the cam-paigns. The 4 months during which the experiments have been conducted were chosen in such a way, according to past observations, the click behaviour of the users was as

homo-geneous over time as possible. During the first 2 months, the bid/budget optimization has been performed by human ex-perts, leading to an average of 350 acquisitions per month with an average cost per acquisition of about 83e. After the first 2 months, the optimization has been performed by the AdComB-TS algorithm in a completely automated fashion. The goal of the company was to reduce the cost per acquisi-tion, given that the cost of 104e was considered excessively large, keeping the same number of acquisitions per month obtained during the first 2 months. We used a discretization of 5e for the budget values and 0.10 e for the bid values. The algorithm, implemented in Python 2.7.12 and executed

on Ubuntu 16.04.1 LTS with an Intel(R) Xeon(R) CPU

E5-2620 v3 2.40GHz, was used at the midnight of each day to decide the bid/budget pairs for the next day. The maximum computation time of the algorithm during the 2 months was less than 1 minute. During the 2 monthsAdComB-TS was executed, it obtained 353 conversions with an average cost per acquisition of about 56e. More precisely, in the ﬁrst month the algorithm was used, the cost per acquisition was about 62e, while, in the second one, about 50 e. Thus, the average reduction of the cost per acquisition during the ﬁrst month of execution of the algorithm has been about 25%, while during the second one about 40%.

Conclusions and Future Works

In the current paper, we presentAdComB, an algorithm ca-pable of deciding automatically the values of the bid and the budget to set in an advertising campaign to maximize in online fashion the value of the campaign given a spend-ing plan. The algorithm exploits Gaussian Processes to es-timate the users’ model, combinatorial bandit techniques to address the exploration/exploitation dilemma, and optimiza-tion techniques to solve a knapsack-like problem. We pro-pose two ﬂavours of the algorithm, namely AdComB-TS andAdComB-BUCB, differing for the criterion used for the bandit choice. Experiments on both a realistic synthetic set-ting and real-world setset-ting show that our algorithms tackle the problem properly, outperforming other naive algorithms based on existing solutions and the human expert.

As future work, we plan to study the theoretical properties of the pseudo regret of our algorithm, as well as the study of

(8)

techniques to provide a proper setup of the subcampaigns. While in the present work we assume that the environment, including the users and the other advertisers, is stationary over time, we will investigate non-stationary environments, e.g., including in the model the option that there exists peri-odicity in the user behaviour, as well as some sudden change due the modiﬁcation of the competitors marketing policy. Moreover, another interesting line of research is to design methods to set up the subcampaigns and possibly modify their targeting over time, basing on their performance.

Acknowledgments. This research has been funded by the Mediamatic company, part of the MMM group. We sin-cerely thank Roberto Coronel Da Silva, Enrico Dellavalle, and Paola Corbani for their valuable support.

References

Bishop, C. M. 2006. Pattern recognition and machine learning. Springer.

Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learning,

and games. Cambridge University Press.

Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In NIPS, 2249–2257.

Chen, W.; Wang, Y.; Yuan, Y.; and Wang, Q. 2016. Combina-torial multi-armed bandit and its extension to probabilistically triggered arms. J MACH LEARN RES 17(1):1746–1778. Chen, W.; Wang, Y.; and Yuan, Y. 2013. Combinatorial multi-armed bandit: General framework and applications. In ICML, 151–159.

Ding, W.; Qin, T.; Zhang, X.-D.; and Liu, T. 2013. Multi-armed bandit with budget constraint and variable costs. In AAAI, 232– 238.

Farina, G., and Gatti, N. 2017. Adopting the cascade model in ad auctions: Efﬁciency bounds and truthful algorithmic mecha-nisms. J ARTIF INTELL RES 59:265–310.

Gai, Y.; Krishnamachari, B.; and Jain, R. 2010. Learning mul-tiuser channel allocations in cognitive radio networks: A com-binatorial multi-armed bandit formulation. In DySPAN, 1–9. IEEE.

Gatti, N.; Lazaric, A.; Rocco, M.; and Trov`o, F. 2015. Truthful learning mechanisms for multi-slot sponsored search auctions with externalities. ARTIF INTELL 227:93–139.

Geyik, S. C.; A-Saxena; and Dasdan, A. 2014. Multi-touch attribution based budget allocation in online advertising. In

AD-KDD, 1–9.

Geyik, S. C.; Faleev, S.; Shen, J.; O’Donnell, S.; and Kolay, S. 2016. Joint optimization of multiple performance metrics in on-line video advertising. In SIGKDD, 471–480.

Granmo, O.-C. 2010. Solving two-armed bernoulli bandit prob-lems using a bayesian learning automaton. IJICC 3(2):207–234. IAB. 2016. Iab internet advertising revenue report2016, full year results. https://www.iab.com. Online; accessed 21 July 2017.

Italia, E. M.; Nuara, A.; Trov`o, F.; Restelli, M.; Gatti, N.; and Dellavalle, E. 2017. Internet advertising for non-stationary en-vironments. In AMEC, 1–15.

Kaufmann, E.; Capp´e, O.; and Garivier, A. 2012. On bayesian upper conﬁdence bounds for bandit problems. In AISTATS, 592– 600.

Kellerer, H.; Pferschy, U.; and Pisinger, D. 2004. The

Multiple-Choice Knapsack Problem. Springer. 317–347.

King, M.; Atkins, J.; and Schwarz, M. 2007. Internet advertis-ing and the generalized second-price auction: Selladvertis-ing billions of dollars worth of keywords. AM ECON REV 97(1):242–259. Kireyev, P.; Pauwels, K.; and Gupta, S. 2016. Do display ads inﬂuence search? attribution and dynamics in online advertising.

INT J RES MARK 33(3):475–490.

Lee, K.-C.; Jalali, A.; and Dasdan, A. 2013. Real time bid optimization with smooth budget delivery in online advertising. In ADKDD, 1–9.

Markakis, E., and Telelis, O. 2010. Discrete strategies in key-word auctions and their inefﬁciency for locally aware bidders. In WINE, 523–530.

May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Op-timistic bayesian sampling in contextual-bandit problems. J MACH LEARN RES 13(Jun):2069–2106.

Ontañón, S. 2017. Combinatorial multi-armed bandits for real-time strategy games. J ARTIF INTELL RES 58:665–702. Paladino, S.; Trovò, F.; Restelli, M.; and Gatti, N. 2017. Uni-modal thompson sampling for graph-structured arms. In AAAI. Qin, T.; Chen, W.; and Liu, T.-Y. 2015. Sponsored search auc-tions: Recent advances and future directions. ACM T INTEL

SYST TEC 5(4):60:1–60:34.

Rasmussen, C. E., and Williams, C. K. 2006. Gaussian

pro-cesses for machine learning, volume 1. MIT Press.

Sinha, P., and Zoltners, A. A. 1979. The multiple-choice knap-sack problem. OPER RES 27(3):503–515.

Thomaidou, S.; Liakopoulos, K.; and Vazirgiannis, M. 2014. To-ward an integrated framework for automated development and optimization of online advertising campaigns. INTELL DATA

ANAL 18(6):1199–1227.

Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two sam-ples. BIOMETRIKA 25(3/4):285–294.

Trov`o, F.; Paladino, S.; Restelli, M.; and Gatti, N. 2016. Bud-geted multi-armed bandit in continuous action space. In ECAI, 560–568.

Varian, H. R., and Harris, C. 2014. The VCG auction in theory and practice. AM ECON REV 104(5):442–445.

Wang, J.; Zhang, W.; and Yuan, S. 2016. Display advertising with real-time bidding (RTB) and behavioural targeting. CoRR abs/1610.03013.

Weinan, W.; Rong, Y.; Wang, J.; Zhu, T.; and Wang, X. 2016. Feedback control of real-time display advertising. In WSDM, 407–416.

Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T.-Y. 2015. Thompson sampling for budgeted multi-armed bandits. In IJCAI, 3960– 3966.

Xu, J.; Lee, K.-C.; Li, W.; Qi, H.; and Lu, Q. 2015. Smart pac-ing for effective online ad campaign optimization. In SIGKDD, 2217–2226.

Zhang, W.; Zhang, Y.; Gao, B.; Yu, Y.; Yuan, X.; and Liu, T.-Y. 2012. Joint optimization of bid and budget allocation in sponsored search. In SIGKDD, 1177–1185.

Zhang, W.; Yuan, S.; and Wang, J. 2014. Optimal real-time bidding for display advertising. In SIGKDD, 1077–1086.