The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as

$KL(p || q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p || q) \neq KL(q || p)$. Intuitively, a small $KL(p || q)$ means that there are few points that $p$ assigns high probability to but that $q$ does not. We can also think of $KL(p || q)$ as the number of bits of information needed to update from the distribution $q$ to the distribution $p$.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p || q)$ in terms of the $KL(F_i || G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

**Theorem:** If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then

$KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right)$.

**Proof:** If we expand the definition, then we are trying to prove that

$\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as

$\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen's inequality on the $\log$ function:

$\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

**Remark:** Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.