Information Theory

Basics

字数统计: 2.1k阅读时长: 9 min

 2020/02/17   Share

Materials

Divergence

散度divergence $d(p,q)$通常被用来表征”surprise”的程度：你相信一个分布为$q$，经过试验却发现它实际上是$p$的惊讶程度。为了引出KL-Divergence，我们首先定义概率单纯型(Probability Simplex).

Definition (Probability Simplex). For any finite alphabet $\mathcal{X}$ with $|\mathcal{X}|=m$, the probability simplex $\mathcal P^{\mathcal X}$ is the set

$\mathcal{P}^{\mathcal X} \triangleq\left\{q \in \mathbb{R}^{m} | \sum_{i} q_{i}=1, q_{i} \geq 0 \quad \forall i\right\}$

注意：Probability Simplex是一个维度为$m-1$的空间，因为这个空间上的向量需要满足$\sum_{i} q_{i}=1$，从而导致其少了一个自由度。

Definition (Kullback-Leibler Divergence). For $p, q \in \mathcal{P}^{\mathcal X}$, the Kullback-Leibler (KL) divergence of $q$ from $p$ is

$d(p, q) \triangleq \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}$

where $\log$ represents $\ln$, and we are using the conventions that $\log \infty=\infty$, $\log 0=-\infty$, $0/0=0$, and $0\times \infty=0$.

注意：同entropy不同，KL divergence不仅仅局限于离散型变量，对于连续型变量也是适用的：

$D_{K L}(p(x) \| q(x))=\int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} d x$

以上也是为什么作者会从KL divergence开始介绍Information Theory，因为KL divergence是一个更加广泛的定义。在实际计算两个分布的的KL divergence时，因为采样会不完全，所以需要用到以上数值的习惯性定义，或者使用function approximation来拟合观察到的分布；另一种做法是在计算的时候使用平滑策略：为两个变量missing $x$值赋予一个小的频率常数$\epsilon$，这部分多出来的频率由其他采样过的$x$支付，这样就保证了所有的频率都是非零的。

Theorem (Gibbs’ Inequality). For all $p, q \in \mathcal{P}^{\mathcal{X}}$ , it holds that $d(p, q) \geq 0$, with equality iff $p = q$.

证明这个定理的思路：

等价于对于任意的真实分布$p$，$\min d(p,q) = 0$，且等号成立条件为$q=p$.
使用拉格朗日乘数法计算梯度为0的点即可.

注意：这条定理告诉我们为什么KL divergence和距离的定义非常相似，因为它满足了距离中的正则性。而同时我们需要注意的是它并不满足距离的对称或是三角不等式性质。

在机器学习中，我们经常需要做极大似然估计：

$\theta^{*}=\underset{\theta}{\operatorname{argmax}} \prod_{i=1}^{n} p_{X}^{\theta}\left(x_{i}\right)$

其中，$p_{X}^{\theta}$表示的是由参数$\theta$决定的在随机变量$X$上的概率分布。而其实我们可以证明，极大似然估计的等价就是极小KL divergence函数的优化：

$\theta^{*} = \arg\min_\theta d(\hat{p}_X,p^\theta_X)$

证明：Maximum Likelihood Estimation = KL Divergence Minimization. ($P_G$为估计的分布，$P_\text{data}$为真实分布)

$\begin{aligned} \theta^{*} &=\arg \max _{\theta} \prod_{i=1}^{m} P_{G}\left(x^{i} ; \theta\right) \\ &=\arg \max _{\theta} \log \prod_{i=1}^{m} P_{G}\left(x^{i} ; \theta\right) \\ &=\arg \max _{\theta} \sum_{i=1}^{m} \log P_{G} \\ & \approx \arg \max _{\theta} E_{x \sim P_{\text {data }}}\left[\log P_{G}(x ; \theta)\right] \\ &= \arg \max _{\theta} \int_{x} P_{\text {data }}(x) \log P_{G}(x ; \theta) d x-\int_{x} P_{\text {data }}(x) \log P_{\text {data }}(x) d x \\ &= \arg \min _{\theta} \operatorname{KL}(P_{\text{data}}\|P_G) \end{aligned}$

Jensen-Shannon Divergence

Definition (Jensen-Shannon Divergence) The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence $ d(P|Q)$ . It is defined by:

$\operatorname{JSD}(P \| Q)=\frac{1}{2} d(P \| M)+\frac{1}{2} d(Q \| M)$

where $M=\frac{1}{2}(P+Q)$. The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.

$0 \leq \mathrm{JSD}(P \| Q) \leq 1$

Entropy

Definition (Shannon Entropy). The Shannon entropy of $p \in \mathcal{P}^{\mathcal X}$ is

$H(p) \triangleq-\sum_{x \in \mathcal{X}} p(x) \log p(x)$

我们也使用 $H(X)$ 来表示随机变量$X$的熵。

Theorem The Shannon entropy is related to the divergence according to the formula

$H(p) = \log |X| - d(p,u)$

其中$u$表示在集合$X$上的均匀分布。通过这条定理我们可以知道一个分布的熵和你相信这个分布是均匀分布时得到的surprise是负相关的，同时因为Gibbs’ Inequality，我们知道当一个分布是均匀分布的时候熵取到最大。另外，这条定理也告诉我们熵是不能够在连续型随机变量上定义的，因为在$\mathbb R$上是没有“均匀分布的”.

而对于更加常规的信息论学习，通常是先介绍Entropy的定义，然后转而介绍KL divergence等。Entropy翻译作信息熵，也就是一个离散型随机变量的不确定性。不确定性的定义可以被表示成一个编码问题：一个alphabet中的每个事件已知其发生的概率分布，该如何对每个事件进行编码从而能够付出最小的代价？以下为结论：如果事件$x$发生的概率为$p(x)$，最优编码位数为$-\log_2{p(x)}$，则该事件的总代价为$-p(x)\log_2{p(x)}$。从而有了以上的Entropy定义。均匀分布的时候拥有最大的不确定性，也就是对其编码的代价最大。

Conditional Entropy

概率论中最美妙的就是引入了Conditional Probability $P(A|B)$，也就是$B$事件发生的前提下$A$事件发生的概率。同样的在信息论中我们也会关注到多变量分布的情况。为了表述清晰，以下是我们会用到的符号及其意义：

$p_{X} \in \mathcal{P}^{\mathcal X}$: 一个在alphabet $\mathcal X$上的随机变量$X$的分布$p$
$p_{Y} \in \mathcal{P}^{\mathcal Y}$: 一个在alphabet $\mathcal Y$上的随机变量$Y$的分布$p$
$p_{X,Y} \in \mathcal{P}^{\mathcal X \times \mathcal Y}$: 一个在他们上面的联合分布$p$
$p_{X} \otimes p_{Y} \in \mathcal{P}^{\mathcal{X} \times \mathcal Y}$: 定义为边缘分布的乘积，也即$\left(p_{X} \otimes p_{Y}\right)(x, y)= p_{X}(x) p_{Y}(y)$.

Definition (Conditional Entropy). The conditional entropy of $X$ given $Y$ is

$H(X | Y) \triangleq \sum_{x, y \in \mathcal{X} \times \mathcal{Y}} p_{X, Y}(x, y) \log p_{X | Y}(x | y)$

注意：有人可能会觉得为了维持对称性，需要将条件熵定义为：

$\tilde{H}(X | Y)=\sum_{x, y \in \mathcal{X} \times \mathcal{Y}} p_{X | Y}(x | y) \log p_{X | Y}(x | y)$

However, a quick think makes clear that this definition isn’t appropriate, because it doesn’t include any information about the distribution of $Y$. If $Y$ is concentrated around some very informative (or uninformative) values, then $\tilde H$ won’t notice that some values are more valuable than others.

Theorem The conditional entropy is related to the unconditional entropy as

$H(X|Y)= H(X,Y)-H(Y)$

where $H(X,Y)$ is the entropy of the joint distribution $p_{X,Y}$.

这一条定理非常容易记住，只需要和条件概率公式进行比对：

$p_{X|Y}(x|y)={p_{X,Y}(x,y)\over p_Y(y)}$

或者用更加通俗易懂的语言来解释它：

The uncertainty you have about $X$ given that you’ve been told $Y$ is equal to the uncertainty you had about both $X$ and $Y$ , less the uncertainty that was resolved when you learned $Y$.

Theorem (Side Information reduces uncertainty).

$H(X|Y) \leq H(X)$

该定理：额外的信息不会增加一个随机变量的不确定性。提供额外的信息$Y$造成的增益$H(X)-H(X|Y) \geq 0$，增益为0时就是$Y$和$X$互不相关(Y carries no information about X)。

Information

Definition (Mutual Information). The mutual information $I(X,Y)$ in $Y$ about $X$ is

$I(X,Y) \triangleq H(X) - H(X|Y).$

Theorem The mutual information may also be written as

$\begin{align} I(X,Y) &= d(p_{X,Y}, p_{X} \otimes p_{Y}) \\ &= \mathbb E _Y[d(p_{X|Y}, p_X)] \end{align}$

Mutual Information在很多情况下也叫做Information Gain（信息增益），是用来衡量两个随机变量之间的dependence的：Mutual Information越大，则两个随机变量间的关联度越高。而根据第一条等式推知：the larger the divergence between the joint and the product of the marginals, the stronger the dependence between the two variables. 而同时我们互换第一条等式$XY$变量可以得到以下：

Corrolary The mutual information is symmetric:

$I(X,Y) = I(Y,X)$

而第二条等式的含义则是如果你选择无视$Y$给出的side information并还是认为分布是按照$p_X$的，那么mutual information或者就等于你选择忽视和接受Y之间的surprise。

Theorem (Data Processing Inequality). Let $X$ and $Y$ be random variables, and let $Z = g(Y)$, where $g$ is any function $g : Y \rightarrow Z$. Then,

$I(X,Z) \leq I(X,Y)$

这条定理告诉我们，对于一个作为side information的随机变量，对这个变量（或者该变量的估计）进行任意的函数映射操作，所得到的结果作为新的side information，其提供的mutual information一定不会超过原来的随机变量。

原文作者：Haizhou Shi

原文链接：http://www.shihaizhou.com/2020/02/17/Information-Theory-Simple-Concepts/

发表日期：February 17th 2020, 3:59:39 pm

更新日期：August 2nd 2020, 12:06:06 am

Next Post

Graph Neural Networks - Basics
Previous Post

Metric Space

CATALOG

1. Divergence
1. 1.1. Jensen-Shannon Divergence
2. Entropy
1. 2.1. Conditional Entropy
3. Information

