Loading [MathJax]/jax/output/CommonHTML/jax.js

Exponential Families

In my last post I discussed log-linear models. In this post I'd like to take another perspective on log-linear models, by thinking of them as members of an exponential family. There are many reasons to take this perspective: exponential families give us efficient representations of log-linear models, which is important for continuous domains; they always have conjugate priors, which provide an analytically tractable regularization method; finally, they can be viewed as maximum-entropy models for a given set of sufficient statistics. Don't worry if these terms are unfamiliar; I will explain all of them by the end of this post. Also note that most of this material is available on the Wikipedia page on exponential families, which I used quite liberally in preparing the below exposition.

1. Exponential Families

An exponential family is a family of probability distributions, parameterized by θRn, of the form

p(xθ)h(x)exp(θTϕ(x)). (1)

Notice the similarity to the definition of a log-linear model, which is

p(xθ)exp(θTϕ(x)). (2)

So, a log-linear model is simply an exponential family model with h(x)=1. Note that we can re-write the right-hand-side of (1) as exp(θTϕ(x)+logh(x)), so an exponential family is really just a log-linear model with one of the coordinates of θ constrained to equal 1. Also note that the normalization constant in (1) is a function of θ (since θ fully specifies the distribution over x), so we can express (1) more explicitly as

p(xθ)=h(x)exp(θTϕ(x)A(θ)), (3)

where

A(θ)=log(h(x)exp(θTϕ(x))d(x)). (4)

Exponential families are capable of capturing almost all of the common distributions you are familiar with. There is an extensive table on Wikipedia; I've also included some of the most common below:

  1. Gaussian distributions. Let ϕ(x)=[x x2]. Then p(xθ)exp(θ1x+θ2x2). If we let θ=[μσ2,12σ2], then p(xθ)exp(μxσ2x22σ2)exp(12σ2(xμ)2). We therefore see that Gaussian distributions are an exponential family for ϕ(x)=[x x2].
  2. Poisson distributions. Let ϕ(x)=[x] and h(x)={1x!:x{0,1,2,} 0:else. Then p(kθ)1k!exp(θx). If we let θ1=log(λ) then we get p(kθ)λkk!; we thus see that Poisson distributions are also an exponential family.
  3. Multinomial distributions. Suppose that X={1,2,,n}. Let ϕ(k) be an n-dimensional vector whose kth element is 1 and where all other elements are zero. Then p(kθ)exp(θk)exp(θk)nk=1exp(θk). If θk=logP(x=k), then we obtain an arbitrary multinomial distribution. Therefore, multinomial distributions are also an exponential family.

2. Sufficient Statistics

A statistic of a random variable X is any deterministic function of that variable. For instance, if X=[X1,,Xn]T is a vector of Gaussian random variables, then the sample mean ˆμ:=(X1++Xn)/n and sample variance ˆσ2:=(X21++X2n)/n(X1++Xn)2/n2 are both statistics.

Let F be a family of distributions parameterized by θ, and let X be a random variable with distribution given by some unknown θ0. Then a vector T(X) of statistics are called sufficient statistics for θ0 if they contain all possible information about θ0, that is, for any function f, we have

E[f(X)T(X)=T0,θ=θ0]=S(f,T0), (5)

for some function S that has no dependence on θ0.

For instance, let X be a vector of n independent Gaussian random variables X1,,Xn with unknown mean μ and variance σ. It turns out that T(X):=[ˆμ,ˆσ2] is a sufficient statistic for μ and σ. This is not immediately obvious; a very useful tool for determining whether statistics are sufficient is the Fisher-Neyman factorization theorem:

Theorem 1 (Fisher-Neyman) Suppose that X has a probability density function p(Xθ). Then the statistics T(X) are sufficient for θ if and only if p(Xθ) can be written in the form

p(Xθ)=h(X)gθ(T(X)). (6)

In other words, the probability of X can be factored into a part that does not depend on θ, and a part that depends on θ only via T(X).

What is going on here, intuitively? If p(Xθ) depended only on T(X), then T(X) would definitely be a sufficient statistic. But that isn't the only way for T(X) to be a sufficient statistic --- p(Xθ) could also just not depend on θ at all, in which case T(X) would trivially be a sufficient statistic (as would anything else). The Fisher-Neyman theorem essentially says that the only way in which T(X) can be a sufficient statistic is if its density is a product of these two cases.

Proof: If (6) holds, then we can check that (5) is satisfied:

E[f(X)T(X)=T0,θ=θ0]=T(X)=T0f(X)dp(Xθ=θ0)T(X)=T0dp(Xθ=θ0)  =T(X)=T0f(X)h(X)gθ(T0)dXT(X)=T0h(X)gθ(T0)dX  =T(X)=T0f(X)h(X)dXT(X)=T0h(X)dX,

where the right-hand-side has no dependence on θ.

On the other hand, if we compute E[f(X)T(X)=T0,θ=θ0] for an arbitrary density p(X), we get

E[f(X)T(X)=T0,θ=θ0]=T(X)=T0f(X)p(Xθ=θ0)T(X)=T0p(Xθ=θ0)dXdX.

If the right-hand-side cannot depend on θ for any choice of f, then the term that we multiply f by must not depend on θ; that is, p(Xθ=θ0)T(X)=T0p(Xθ=θ0)dX must be some function h0(X,T0) that depends only on X and T0 and not on θ. On the other hand, the denominator T(X)=T0p(Xθ=θ0)dX depends only on θ0 and T0; call this dependence gθ0(T0). Finally, note that T0 is a deterministic function of X, so let h(X):=h0(X,T(X)). We then see that p(Xθ=θ0)=h0(X,T0)gθ0(T0)=h(X)gθ0(T(X)), which is the same form as (6), thus completing the proof of the theorem.

Now, let us apply the Fisher-Neyman theorem to exponential families. By definition, the density for an exponential family factors as

p(xθ)=h(x)exp(θTϕ(x)A(θ)).

If we let T(x)=ϕ(x) and gθ(ϕ(x))=exp(θTϕ(x)A(θ)), then the Fisher-Neyman condition is met; therefore, ϕ(x) is a vector of sufficient statistics for the exponential family. In fact, we can go further:

Theorem 2 Let X1,,Xn be drawn independently from an exponential family distribution with fixed parameter θ. Then the empirical expectation ${\hat{\phi} := \frac{1}{n} \sum{i=1}^n \phi(X_i)}isasufficientstatisticfor\theta$._

Proof: The density for X1,,Xn given θ is

p(X1,,Xnθ)=h(X1)h(Xn)exp(θTni=1ϕ(Xi)nA(θ)) =h(X1)h(Xn)exp(n[ˆϕA(θ)]).

Letting h(X1,,Xn)=h(X1)h(Xn) and gθ(ˆϕ)=exp(n[ˆϕA(θ)]), we see that the Fisher-Neyman conditions are satisfied, so that ˆϕ is indeed a sufficient statistic.

Finally, we note (without proof) the same relationship as in the log-linear case to the gradient and Hessian of p(X1,,Xnθ) with respect to the model parameters:

Theorem 3 Again let X1,,Xn be drawn from an exponential family distribution with parameter θ. Then the gradient of p(X1,,Xnθ) with respect to θ is

n×(ˆϕE[ϕθ])

and the Hessian is

n×(E[ϕθ]E[ϕθ]TE[ϕϕTθ]).

This theorem provides an efficient algorithm for fitting the parameters of an exponential family distribution (for details on the algorithm, see the part near the end of the log-linear models post on parameter estimation).

3. Moments of an Exponential Family

If X is a real-valued random variable, then the pth moment of X is E[Xp]. In general, if X=[X1,,Xn]T is a random variable on Rn, then for every sequence p1,,pn of non-negative integers, there is a corresponding moment Mp1,,pn:=E[Xp11Xpnn].

In exponential families there is a very nice relationship between the normalization constant A(θ) and the moments of X. Before we establish this relationship, let us define the moment generating function of a random variable X as f(λ)=E[exp(λTX)].

Lemma 4 The moment generating function for a random variable X is equal to

p1,,pn=0Mp1,,pnλp11λpnnp1!pn!.

The proof of Lemma 4 is a straightforward application of Taylor's theorem, together with linearity of expectation (note that in one dimension, the expression in Lemma 4 would just be p=0E[Xp]λpp!).

We now see why f(λ) is called the moment generating function: it is the exponential generating function for the moments of X. The moment generating function for the sufficient statistics of an exponential family is particularly easy to compute:

Lemma 5 If p(xθ)=h(x)exp(θTϕ(x)A(θ)), then E[exp(λTϕ(x))]=exp(A(θ+λ)A(θ)).

Proof:

E[exp(λTx)]=exp(λTx)p(xθ)dx =exp(λTx)h(x)exp(θTϕ(x)A(θ))dx =h(x)exp((θ+λ)Tϕ(x)A(θ))dx =h(x)exp((θ+λ)Tϕ(x)A(θ+λ))dx×exp(A(θ+λ)A(θ)) =p(xθ+λ)dx×exp(A(θ+λ)A(θ)) =exp(A(θ+λ)A(θ)),

where the last step uses the fact that p(xθ+λ) is a probability density and hence p(xθ+λ)dx=1.

Now, by Lemma 4, Mp1,,pn is just the (p1,,pn) coefficient in the Taylor series for the moment generating function f(λ), and hence we can compute Mp1,,pn as p1++pnf(λ)p1λ1pnλn. Combining this with Lemma 5 gives us a closed-form expression for Mp1,,pn in terms of the normalization constant A(θ):

Lemma 6 The moments of an exponential family can be computed as

Mp1,,pn=p1++pnexp(A(θ+λ)A(θ))p1λ1pnλn.

For those who prefer cumulants to moments, I will note that there is a version of Lemma 6 for cumulants with an even simpler formula.

Exercise: Use Lemma 6 to compute E[X6], where X is a Gaussian with mean μ and variance σ2.

4. Conjugate Priors

Given a family of distributions p(Xθ), a conjugate prior family p(θα) is a family that has the property that

p(θX,α)=p(θα)

for some α depending on α and X. In other words, if the prior over θ lies in the conjugate family, and we observe X, then the posterior over θ also lies in the conjugate family. This is very useful algebraically as it means that we can get our posterior simply by updating the parameters of the prior. The following are examples of conjugate families:

Gaussian-Gaussian

Let p(Xμ)exp((Xμ)2/2), and let p(μμ0,σ0)exp((μμ0)2/2σ20). Then, by Bayes' rule,

p(μX=x,μ0,σ0)exp((xμ)2/2)exp((μμ0)2/2σ20) =exp((μμ0)2+σ20(μx)22σ20) exp((1+σ0)2μ22(μ0+σ20x)μ2σ20) exp(μ22μ0+xσ201+σ20μ2σ20/(1+σ20)) exp((μ(μ0+xσ20)/(1+σ20))22σ20/(1+σ20)) p(μμ0+xσ201+σ20,σ01+σ20).

Therefore, μ0,σ0 parameterize a family of priors over μ that is conjugate to Xμ.

Beta-Bernoulli

Let X{0,1}, θ[0,1], p(X=1θ)=θ, and p(θα,β)θα1(1θ)β1. The distribution over X given θ is then called a Bernoulli distribution, and that of θ given α and β is called a beta distribution. Note that p(Xθ) can also be written as θX(1θ)1X. From this, we see that the family of beta distributions is a conjugate prior to the family of Bernoulli distributions, since

p(θX=x,α,β)θx(1θ)1x×θα1(1θ)β1 =θα+x1(1θ)β+(1x)1 p(θα+x,β+(1x)).

Gamma-Poisson

Let p(X=kλ)=λkeλk! for ${k \in \mathbb{Z}{\geq 0}}.Let{p(\lambda \mid \alpha, \beta) \propto \lambda^{\alpha-1}\exp(-\beta \lambda)}.Asnotedbefore,thedistributionfor{X}given{\lambda}$ is called a Poisson distribution; the distribution for λ given α and β is called a gamma distribution. We can check that the family of gamma distributions is conjugate to the family of Poisson distributions.Important note:_ unlike in the last two examples, the normalization constant for the Poisson distribution actually depends on λ, and so we need to include it in our calculations: p(λX=k,α,β)λkeλk!×λα1exp(βλ) λα+k1exp((β+1)λ) p(λα+k,β+1). Note that, in general, a family of distributions will always have some conjugate family, as if nothing else the family of all probability distributions over θ will be a conjugate family. What we really care about is a conjugate family that itself has nice properties, such as tractably computable moments.

Conjugate priors have a very nice relationship to exponential families, established in the following theorem:

Theorem 7 Let p(xθ)=h(x)exp(θTϕ(x)A(θ)) be an exponential family. Then p(θη,κ)h2(θ)exp(ηTθκA(θ)) is a conjugate prior for xθ for any choice of h2. The update formula is p(θx,η,κ)=p(θη+ϕ(x),κ+1). Furthermore, θϕ,κ is itself an exponential family, with sufficient statistics [θ;A(θ)].

Checking the theorem is a matter of straightforward algebra, so I will leave the proof as an exercise to the reader. Note that, as before, there is no guarantee that p(θη,κ) will be tractable; however, in many cases the conjugate prior given by Theorem 7 is a well-behaved family. See this Wikipedia page for examples of conjugate priors, many of which correspond to exponential family distributions.

5. Maximum Entropy and Duality

The final property of exponential families I would like to establish is a certain duality property. What I mean by this is that exponential families can be thought of as the maximum entropy distributions subject to a constraint on the expected value of their sufficient statistics. For those unfamiliar with the term, the entropy of a distribution over X with density p(X) is E[logp(X)]:=p(x)log(p(x))dx. Intuitively, higher entropy corresponds to higher uncertainty, so a maximum entropy distribution is one specifying as much uncertainty as possible given a certain set of information (such as the values of various moments). This makes them appealing, at least in theory, from a modeling perspective, since they "encode exactly as much information as is given and no more". (Caveat: this intuition isn't entirely valid, and in practice maximum-entropy distributions aren't always necessarily appropriate.)

In any case, the duality property is captured in the following theorem:

Theorem 8 The distribution over X with maximum entropy such that E[ϕ(X)]=T lies in the exponential family with sufficient statistic ϕ(X) and h(X)=1.

Proving this fully rigorously requires the calculus of variations; I will instead give the "physicist's proof". Proof: } Let p(X) be the density for X. Then we can view p as the solution to the constrained maximization problem:

maximizep(X)logp(X)dX subject top(X)dX=1 p(X)ϕ(X)dX=T.

By the method of Lagrange multipliers, there exist α and λ such that

ddp(p(X)logp(X)dXα[p(X)dX1]λT[ϕ(X)p(X)dXT])=0.

This simplifies to:

logp(X)1αλTϕ(X)=0,

which implies

p(X)=exp(1αλTϕ(X))

for some α and λ. In particular, if we let λ=θ and α=A(θ)1, then we recover the exponential family with h(X)=1, as claimed.

6. Conclusion

Hopefully I have by now convinced you that exponential families have many nice properties: they have conjugate priors, simple-to-fit parameters, and easily-computed moments. While exponential families aren't always appropriate models for a given situation, their tractability makes them the model of choice when no other information is present; and, since they can be obtained as maximum-entropy families, they are actually appropriate models in a wide family of circumstances.

Jacob Steinhardt

Jacob Steinhardt


0 Comments

Sign in to join the conversation.

  Loading...
Powered by Cove