# Regret bound for mixture method

Summary: Derivation of upper bound on the regret for the mixture method (KT scheme) for individual sequence prediction over finite alphabets.

Suppose \mathcal{X} = \{1, \ldots, m\} for some integer m \geq 2 and let \Theta = \Delta_m denote the m-1 dimensional probability simplex. We will use p_\theta on \mathcal{X} to denote the distribution parametrized by any \theta \in \Theta.

Consider the following game.

For t=1,2, \ldots, n:

• Player predicts a distribution q_n(\cdot|x^{t-1}) on \mathcal{X}
• Nature reveals an arbitrary element x_t \in \mathcal{X}
• Player incurs a loss of -\log(q_n(x_t|x^{t-1})).

The goal of the player is to select a distribution q_n whose cumulative loss is small when compared to the best distribution (with hindsight) among the family \{p_{\theta}^n: \theta \in \Theta\}. More formally, the goal is to sequentially select a distribution q_n on \mathcal{X}^n with small worst-case regret: \begin{aligned} r_n(q_n) &= \max_{x^n \in \mathcal{X}^n} \;\max_{\theta \in \Theta} \; \log \left( \frac{\log(p_{\theta}^n(x^n))}{\log(q_n(x^n))} \right ) \\ & = \max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ), \end{aligned}

where \hat{\theta} is the maximum likelihood estimator and x^n = (x_1, \ldots, x_n).

The optimum worst case regret or the minimax regret can then be defined as r_n = \min_{q_n} \;\max_{x^n \in \mathcal{X}^n} \; \log \left( \frac{\log(p_{\hat{\theta}}^n(x^n))}{\log(q_n(x^n))} \right ).

## Mixture Method

In this note, we will analyze the regret of the mixture distribution that assigns the probability \int_{\Theta} p_{\theta}^n(x^n) w(\theta) d\theta to x^n according to some prior distribution w on \Theta. In particular, we will consider the mixture distribution with \texttt{Dirichlet}(1/2, \ldots, 1/2) prior, and denote it by m_J(\cdot). Here the subscript indicates that w(\theta) is the Jeffreys prior in this case.

With this choice of prior, the mixture distribution as well as the conditional distributions can be written in closed form. For any t \geq 1, and x^t \in \mathcal{X}^t, suppose T_{i,t} = \sum_{j=1}^t 1_{x_j=i} is the number of occurrences of i in x^t. Then we can write m_{J}(x^n) as follows: \begin{aligned} &m_J(x^n) =\frac{D_m \left( T_{1,n} +1/2, \ldots, T_{m,n} + 1/2 \right) }{ D_m\left(1/2, \ldots, 1/2 \right)}, \qquad \text{where} \\ &D_m(\alpha_1, \ldots, \alpha_m) = \prod_{i=1}^m \Gamma(\alpha_i)/\Gamma(\sum_{i'=1}^m \alpha_{i'}), \end{aligned} and \Gamma(\cdot) is the Gamma function. This gives the following closed form expression for the conditional distribution m_{J}(x_{t+1} = i|x^{t}) = \frac{T_{i,t} + 1/2}{t+m/2}, which can be used to sequentially implement m_J(x^n) using the identity m_J(x^n) = \prod_{t=1}^n m_J(x_t|x^{t-1}).

We can now state the main result, which gives us a bound on the regret of m_J for any sequence x^n.

Theorem: Suppose C_m = \log (D_m(1/2, \ldots, 1/2)), and T_{\min} = \min_{i} T_{i,n}. Then, for any x^n \in \mathcal{X}^n, we have the following bound: \log \left( \frac{\hat{p}_{\theta}^n(x^n)}{m_J(x^n)} \right) \leq \frac{m-1}{2} \log \left( \frac{n}{2\pi} \right) + C_m + \frac{m}{4 T_{\min} + 2} + o(1).

In particular, the above bound implies that if the relative frequencies of the terms in x^n lies strictly inside the simplex (i.e., \lim_{n \to \infty} T_{\min} = \infty), then the regret incurred by m_J is (m-1)\log(n/2\pi)/2 + C_m + o(1). As proved in Theorem 2 of Xie and Barron (2000), this is equal to the minimax regret (r_n) asymptotically.

Before proceeding to the proof, we first discuss the implication of the above result to the problem of betting in horse raceshere horse races model betting games where in every round only one of m possible outcomes occur (i.e. only one out of m horses wins the race).

.

### Connections to betting

Consider n rounds of horse races, each with m horses numbered \\{1, \ldots, m \\}. Let \mathcal{X}^n = (X_1, \ldots, X_n) denote the indices of the winning horses in these n rounds. Suppose a gambler begins with an initial wealth of 1 dollar, and bets q(i |x^{t-1}) proportion of his wealth on horse i to win race t with oddsOdds of a-for-1 on horse i means the following: the gambler pays 1 dollar before the race to bet on horse i; if horse i wins, he gets back a dollars while if horse i loses, he receives nothing.

O(i | x^{t-1})-for-1. Then his total wealth after n races is S_n(q, X^n) = \prod_{t=1}^n q(X_t | X^{t-1}) O(X_t|X^{t-1}) = q(X^n)O(X^n).

Now, if the winning indices X_1, \ldots, X_n were generated i.i.d. according to distribution p_{\theta}, then the betting strategy that maximizes the expected growth-rate of the wealth process is q^* = p_{\theta}. This is known as Kelly betting or proportional betting strategy. \begin{aligned} q^* & \in \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( O(X) q(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log\left( \frac{q(X)}{p_{\theta}(X)} \right) + \log \left( p(X) O(X) \right) \right] \\ & = \text{argmax}_{q} \; \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - D_{KL}(p_{\theta}//q) \\ & = \mathbb{E}_{p_{\theta}} \left[ \log \left( p(X) O(X) \right) \right] - \text{argmin}_{q} \; D_{KL}(p_{\theta}//q), \end{aligned} which implies that q^*=p_{\theta} due to non-negativity of KL-divergence.

Now, suppose we don’t make any probabilistic assumptions on the process generating X^n. We can still define the best constant betting strategy (in hindsight) for a sequence x^n as the one which maximizes the wealth: \max_{\theta} p_{\theta}^n(x^n) O(x^n) = p_{\hat{\theta}}^n(x^n) O(x^n).

If the gambler follows some strategy q_n for betting, then the ratio the wealth gained by the two policies (i.e., p_{\hat{\theta}}^n and q_n) is given by \frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} = \frac{p_{\hat{\theta}}^n(x^n)}{q_n(x^n)} independent of the odds offered. Hence, the problem of finding the distribution q_n that maximizes this ratio in the worst-case (i.e., for the worst sequence x^n \in \mathcal{X}^n) reduces to the sequential prediction problem considered above. In particular, the regret bound for m_J stated earlier implies the following:

\frac{S_n(p_{\hat{\theta}}^n, x^n)}{S_n(q_n, x^n)} \leq \frac{m-1}{2} \log(n/2\pi) + C_m + o(1), assuming that the relative frequencies of x^n lies in the interior of the simplex. This further gives us the following bound S_n(m_J, x^n) \geq S_n \left( \hat{p}_{\theta}^n, x^n \right) \left( \frac{n}{2\pi} \right)^{(m-1)/2}. e^{-C_m - o(1)}.

### Proof of the Regret Bound

Throughout this section, we will use T_i instead of T_{i,n}. To prove the result, we first note the following:

\begin{aligned} \log \left( \frac{p_{\hat{\theta}}(x^n)}{m_J(x^n)} \right) & = \log( p_{\hat{\theta}}(x^n) ) + C_m - \log \left( D_m\left(T_1+\frac{1}{2}, \ldots, T_m+\frac{1}{2}\right) \right) \qquad (1) \end{aligned}

Since \hat{\theta} = \left( (T_1/n), \ldots, (T_m/n) \right), we have \log(p_{\hat{\theta}}^n(x^n)) = \sum_{i=1}^m T_i \log(T_i/n) = -n \log n + \sum_{i}T_i \log(T_i). \qquad (2)

For the remaining term, we use Stirling’s approximation \Gamma(x) = x^{x-1/2}e^{-x} \sqrt{2\pi} e^{s_x} with s_x \in (0, 1/(12x)). Plugging this into the expression for A = D_m(T_1+1/2, \ldots, T_m+1/2), we get

\begin{aligned} A &= \frac{ \prod_{i=1}^m \Gamma(T_i + 1/2) }{\Gamma(n+m/2)} = \frac{ \prod_{i} \left( (T_i+1/2)^{T_i} e^{-T_i} \sqrt{2\pi} e^{-s_i} \right) }{ (n+m/2)^{n+(m-1)/2} e^{-n-m/2} \sqrt{2\pi}e^{-s_n} }, \end{aligned} where s_i \in (0, 1/12T_i) and s_n \in (0, 1/(12(n+m/2))). Simplifying further, we get \begin{aligned} A &= \frac{ \left( \prod_{i} (T_i+1/2)^{T_i} \right) (2\pi)^{(m-1)/2} }{(n+m/2)^{n+(m-1)/2}e^{-m/2} e^{s_n - \sum_{i}s_i} } \end{aligned} which implies

\begin{aligned} \log(1/A) &= \left( n + \frac{m-1}{2} \right) \log(n+m/2) - \frac{m-1}{2} \log(2\pi) \\\\ &\ \qquad + (s_n - \sum_i s_i) - \sum_i T_i \log(T_i + 1/2). \qquad (3) \end{aligned}

Plugging equations (2) and (3) into (1), we get the following:

\begin{aligned} \log \left( \frac{p_{\hat{\theta}}^n(x^n)}{m_J(x^n)} \right) &= \frac{m-1}{2} \log(n/2\pi) + C_m + \texttt{Rem}_n, \end{aligned}

where the remainder terms is defined as

\begin{aligned} \texttt{Rem}_n &= -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i} \right) + n \log(1+\frac{m}{2n}) \\\\ & \quad + \frac{m-1}{2} \log \left( 1 + \frac{m}{2n} \right) + (s_n - \sum\_{i=1}^m s_i). \qquad (4) \end{aligned}

To complete the proof, we will obtain upper bounds on the terms in the RHS of (4). First we observe that s_n - \sum\_{i=1}^m s_i < s_n < \frac{1}{12(n+m/2)} < \frac{1}{12n + 6}.

Next, we use the following inequality for any x>0 \frac{1}{2} - \frac{1}{4(x+1/2)} \leq \quad x \log \left(1 + \frac{1}{2x} \right) \quad \leq \frac{1}{2} to bound the remaining terms in (4) as follows:

\begin{aligned} n \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m}{2} \\\\ \frac{m-1}{2} \log\left( 1 + \frac{m}{2n} \right) & \leq \frac{m(m-1)}{4n} \leq \frac{m^2}{4n} \\\\ -\sum\_{i=1}^{m} T_i \log \left( 1 + \frac{1}{2T_i}\right) & \leq -\frac{m}{2} + \frac{m}{4(T_{\min} + 1/2)} \end{aligned}

Combining the inequalities in the above display, we get \begin{aligned} \texttt{Rem}_n &\leq \frac{1}{12n + 6} + \frac{m^2}{4n} + \frac{m}{4T_{\min} + 2} \\\\ & = \frac{m}{4T_{\min} + 2} + o(1). \end{aligned} This completes the proof.