Local KL Divergence

The KL divergence is an important tool for studying the distance between two probability distributions. Formally, given two distributions $p$ and $q$, the KL divergence is defined as

$KL(p || q) := \int p(x) \log(p(x)/q(x)) dx$

Note that $KL(p || q) \neq KL(q || p)$. Intuitively, a small $KL(p || q)$ means that there are few points that $p$ assigns high probability to but that $q$ does not. We can also think of $KL(p || q)$ as the number of bits of information needed to update from the distribution $q$ to the distribution $p$.

Suppose that p and q are both mixtures of other distributions: $p(x) = \sum_i \alpha_i F_i(x)$ and $q(x) = \sum_i \beta_i G_i(x)$. Can we bound $KL(p || q)$ in terms of the $KL(F_i || G_i)$? In some sense this is asking to upper bound the KL divergence in terms of some more local KL divergence. It turns out this can be done:

Theorem: If $\sum_i \alpha_i = \sum_i \beta_i = 1$ and $F_i$ and $G_i$ are all probability distributions, then

$KL\left(\sum_i \alpha_i F_i || \sum_i \beta_i G_i\right) \leq \sum_i \alpha_i \left(\log(\alpha_i/\beta_i) + KL(F_i || G_i)\right)$.

Proof: If we expand the definition, then we are trying to prove that

$\int \left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \alpha_i F_i(x)}{\sum \beta_i G_i(x)}\right) dx \leq \int \left(\sum_i \alpha_iF_i(x) \log\left(\frac{\alpha_i F_i(x)}{\beta_i G_i(x)}\right)\right) dx$

We will in fact show that this is true for every value of $x$, so that it is certainly true for the integral. Using $\log(x/y) = -\log(y/x)$, re-write the condition for a given value of $x$ as

$\left(\sum \alpha_i F_i(x)\right) \log\left(\frac{\sum \beta_i G_i(x)}{\sum \alpha_i F_i(x)}\right) \geq \sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right)$

(Note that the sign of the inequality flipped because we replaced the two expressions with their negatives.) Now, this follows by using Jensen's inequality on the $\log$ function:

$\sum_i \alpha_iF_i(x) \log\left(\frac{\beta_i G_i(x)}{\alpha_i F_i(x)}\right) \leq \left(\sum_i \alpha_iF_i(x)\right) \log\left(\frac{\sum_i \frac{\beta_i G_i(x)}{\alpha_i F_i(x)} \alpha_i F_i(x)}{\sum \alpha_i F_i(x)}\right) = \left(\sum_i \alpha_i F_i(x)\right) \log\left(\frac{\sum_i \beta_i G_i(x)}{\sum_i \alpha_i F_i(x)}\right)$

This proves the inequality and therefore the theorem. $\square$

Remark: Intuitively, if we want to describe $\sum \alpha_i F_i$ in terms of $\sum \beta_i G_i$, it is enough to first locate the $i$th term in the sum and then to describe $F_i$ in terms of $G_i$. The theorem is a formalization of this intuition. In the case that $F_i = G_i$, it also says that the KL divergence between two different mixtures of the same set of distributions is at most the KL divergence between the mixture weights.

Jacob Steinhardt

Jacob Steinhardt


Sign in to join the conversation.