Gibbs' inequality

This is the current revision of this page, as edited by imported>Chaikens at 13:49, 14 September 2024 (→‎Proof by Jensen's inequality: minor improvement in strength/clarity, punctuation.). The present address (URL) is a permanent link to this version.

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

In information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

Josiah Willard Gibbs

Gibbs' inequality

Suppose that   and   are discrete probability distributions. Then

 

with equality if and only if   for  .[1]: 68  Put in words, the information entropy of a distribution   is less than or equal to its cross entropy with any other distribution  .

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:[2]: 34 

 

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

Proof

For simplicity, we prove the statement using the natural logarithm, denoted by ln, since

 

so the particular logarithm base b > 1 that we choose only scales the relationship by the factor 1 / ln b.

Let   denote the set of all   for which pi is non-zero. Then, since   for all x > 0, with equality if and only if x=1, we have:

  

The last inequality is a consequence of the pi and qi being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero qi, however, may have been excluded since the choice of indices is conditioned upon the pi being non-zero. Therefore, the sum of the qi may be less than 1.

So far, over the index set  , we have:

 ,

or equivalently

 .

Both sums can be extended to all  , i.e. including  , by recalling that the expression   tends to 0 as   tends to 0, and   tends to   as   tends to 0. We arrive at

 

For equality to hold, we require

  1.   for all   so that the equality   holds,
  2. and   which means   if  , that is,   if  .

This can happen if and only if   for  .

Alternative proofs

The result can alternatively be proved using Jensen's inequality, the log sum inequality, or the fact that the Kullback-Leibler divergence is a form of Bregman divergence.

Proof by Jensen's inequality

Because log is a concave function, we have that:

 

where the first inequality is due to Jensen's inequality, and   being a probability distribution implies the last equality.

Furthermore, since   is strictly concave, by the equality condition of Jensen's inequality we get equality when

 

and

 .

Suppose that this ratio is  , then we have that

 

where we use the fact that   are probability distributions. Therefore, the equality happens when  .

Proof by Bregman divergence

Alternatively, it can be proved by noting that

 
for all  , with equality holding iff  . Then, sum over the states, we have
 
with equality holding iff  .

This is because the KL divergence is the Bregman divergence generated by the function  .

Corollary

The entropy of   is bounded by:[1]: 68 

 

The proof is trivial – simply set   for all i.

See also

References

  1. ^ 1.0 1.1 Pierre Bremaud (6 December 2012). An Introduction to Probabilistic Modeling. Springer Science & Business Media. ISBN 978-1-4612-1046-7.
  2. ^ David J. C. MacKay (25 September 2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press. ISBN 978-0-521-64298-9.