Capacity of a Discrete Channel

Definition: Discrete Channel @discrete-channel A system whereby a sequence of choices from a finite set of elementary symbols $S_1, \dots, S_n$ can be transmitted from one point to another. Each of the symbols $S_i$ is assumed to have a certain duration in time $t_i$ seconds (not necessarily the same for different $S_i$). Referenced by (1 direct, 1 transitive)Direct references:CapacityTransitive (depth 1):theorem-1

Definition: Capacity (also: channel-capacity) @capacity The capacity $C$ of a discrete channel is given by $$ C = \lim_{t \to \infty} \frac{\log{N(t)}}{t}, $$ where $N(t)$ is the number of allowed @signals of duration $t.$ Referenced by (1 direct)Direct references:Noiseless channel transmitting discrete symbols

Note @note-3 Given an information source where all symbols are of the same time duration, and each symbol represents $s$ bits of information (because it is chosen freely among $2^s$ symbols), and the channel can transmit $n$ symbols per second then the capacity $C$ of the channel is defined to be $ns$ bits per second. In the more general case we have to deal with symbols of various lengths that take different amounts of time to transmit, and so capacity measures not the number of symbols transmitted per second but the amount of information transmitted per second, retaining bits per second as its unit. If $N(t)$ represents the number of sequences of duration $t,$ then $$ N(t) = N(t - t_1) + N(t - t_2) + \cdots + N(t - t_n). $$ This number is equal to the sum of the numbers of sequences ending in $S_1, S_2, \dots, S_n.$ This is a recursive definition - if the last symbol is $S_i,$ and then we have $t - t_i$ time remaining for the previous symbols in the sequence, and we repeat the same question on that remainder. A "well-known result" in finite differences tells us that $$ \lim_{t \to \infty} N(t) = A W_0^t, $$ where $A$ is constant and $W_0$ is the largest real solution of the characteristic equation: $$ W^{-t_1} + W^{-t_2} + \cdots + W^{-t_n} = 1, $$ and therefore $$ C = \lim_{t \to \infty} \frac{\log{A W_0^t}}{t} = \log{W_0}. $$

Example @example-4 Consider Morse code as used in telegraph. We have the following rules: A dot symbol consists of one time unit of line closure followed by one time unit of line open. A dash symbol consists of three time units of line closure followed by one time unit of line open. A letter space symbol consists of three time units of line open. A word space symbol consists of six time units of line open. We're never allowed to send two space symbols in a row, because two sequential letter space symbols are indistinguishable from a word space symbol. Now, our possible terminating states are Symbol Composition Duration (time units) Dot 1 unit closed + 1 unit open 2 Dash 3 units closed + 1 unit open 4 Letter space preceded by dot 1 unit closed + 4 units open 5 Letter space preceded by dash 3 units closed + 4 units open 7 Word space preceded by dot 1 unit closed + 7 units open 8 Word space preceded by dash 3 units closed + 7 units open 10 Note that we had to consider the last two symbols in the case of the last symbol being a space in order to deal with our "no sequential spaces" constraint. Now we have $$ N(t) = N(t - 2) + N(t - 4) + N(t - 5) + N(t - 7) + N(t - 8) + N(t - 10). $$ So, our characteristic equation is $$ W^{-2} + W^{-4} + W^{-5} + W^{-7} + W^{-8} + W^{-10} = 1. \tag{a} $$ With a substitution of $w = 1/W$ we get $$ w^{2} + w^{4} + w^{5} + w^{7} + w^{8} + w^{10} = 1, $$ and $w_0,$ the largest positive root of this equation, found numerically, is about $1.4529,$ and so $$ C = -\log_{2}{(w_0)} \approx 0.539 \text{ bits per unit of time}. $$ Note that in this Morse telegraphy system, we have two states that the channel can be in, based on what the previous symbol transmitted was. If the previous symbol transmitted was a space, we're in state $1,$ and the next symbol can only be a dot or a dash, and the state will change. If the previous symbol transmitted was not a space, we're in state $2,$ and the next symbol can be anything, and the state may or may not change depending on what the next symbol is. We can think of these states being the @nodes of a @directed-graph with the transitions between them being the @edges. A generalization of this is given in the theorem below.

Symbol	Composition	Duration (time units)
Dot	1 unit closed + 1 unit open	2
Dash	3 units closed + 1 unit open	4
Letter space preceded by dot	1 unit closed + 4 units open	5
Letter space preceded by dash	3 units closed + 4 units open	7
Word space preceded by dot	1 unit closed + 7 units open	8
Word space preceded by dash	3 units closed + 7 units open	10

Theorem @theorem-5 Let $b_{ij}^{(s)}$ be the duration of the $s$-th symbol which is allowable in state $i$ and leads to state $j.$ Then the channel-capacity $C$ is equal to $\log{W}$ where $W$ is the largest real root of the @determinantal-equation $$ \left | \sum_{s} W^{-b_{ij}^{(s)}} - \delta_{ij} \right | = 0, $$ where $\delta_{ij}$ is the @Kronecker-delta.

Example @example-6 For example, for our Morse telegraphy example, we have $$ \begin{vmatrix} -1 & W^{-2} + W^{-4} \\ W^{-3} + W^{-6} & W^{-2} + W^{-4} - 1 \end{vmatrix} = 0. $$ Expanding the @determinant on the LHS gives the characteristic equation $(a)$ above.

Information and Entropy in a Discrete Channel

Definition: Self-Information (also: surprisal, Shannon information, Information content) @self-information Given a real number $b > 1$ and an outcome $x$ of a discrete @random-variable $X$ with probability mass function $p,$ the self-information of $x$ is defined as the negative @log-probability $$ I(x) := - \log_{b}{p(x)}. $$ Referenced by (5 direct, 14 transitive)Direct references:Discrete Entropyremark-8Entropytheorem-15note-22Transitive (depth 1):Cross-EntropyKL Divergenceproof-of-joint-entropy-is-less-than-or-equal-to-entropy-of-partstheorem-18theorem-1note-30proof-of-theorem-18joint-entropy-is-base-plus-conditionalremark-14theorem-12joint-entropy-is-less-than-or-equal-to-entropy-of-partsTransitive (depth 2):Model Trainingproof-of-theorem-33remark-25

Remark @remark-8 We can view self-information as an alternative casting of @probability, like how @odds are, with some desirable properties: An event with probability $1$ provides no information - we knew it was going to happen so it happening tells us nothing new. The less probable an event is, the more surprisal it yields. self-information is @additive - if two individual events are measured separately, the total amount of information is the sum of the self-informations of the individual events. The function given above is the unique function of probability (up to a multiplicative scaling factor) that satisfies these three properties.

Remark @remark-9 In a related vein we can develop a function that asks how much choice is involved in the selection of an event, or equivalently, how much uncertainty there is in the outcome. If $p_1, p_2, \dots, p_n$ are the probabilities of events occurring, we want a measure $H(p_1, p_2, \dots, p_n)$ that has the following properties: $H$ should be continuous in the $p_i.$ That is, a small change in a $p_i$ should result in a small change in $H.$ If all the $p_i$ are equal, $p_i = \frac{1}{n},$ then $H$ should be a @monotonic @increasing function of $n.$ With equally likely events there is more choice, or uncertainty, when there are more possible events. If a choice is broken down into two successive choices, the original $H$ should be the weighted sum of the individual values of $H.$ For example, we should have that $$ H(\frac{1}{2}, \frac{1}{3}, \frac{1}{6}) = H(\frac{1}{2}, \frac{1}{2}) + \frac{1}{2}H(\frac{2}{3}, \frac{1}{3}). $$

Definition: Entropy (also: uncertainty) @entropy Given a discrete @random-variable $X,$ which may be any element $x$ within the set $\mathcal{X},$ and is distributed according to $p: \mathcal{X} \to [0,1],$ the entropy is $$ H(X) := - \sum_{x \in \mathcal{X}} p(x) \log{p(x)}. $$ We can choose different bases for the logarithm; throughout these notes a bare $\log$ means $\log_2,$ giving the unit of @bits. An alternative, equivalent definition is that entropy is the expected value of the self-information of a @random-variable: $$ H(X) = \mathbb{E}\left[I(X) \right ] = \mathbb{E} \left [ -\log p(X) \right ]. $$ The unit for entropy is information per symbol. Referenced by (13 direct, 3 transitive)Direct references:Noiseless channel transmitting discrete symbolsDiscrete Entropytheorem-12remark-14theorem-15proof-of-theorem-18theorem-18proof-of-joint-entropy-is-less-than-or-equal-to-entropy-of-partsjoint-entropy-is-less-than-or-equal-to-entropy-of-partsCross-EntropyKL Divergencenote-30joint-entropy-is-base-plus-conditionalTransitive (depth 1):Model Trainingremark-25proof-of-theorem-33

Definition: Commensurable (also: commeasurable) @commensurable Real numbers $a_1, \dots, a_n$ are commensurable if there exists a common measure $m \in \mathbb{R}_{>0}$ such that each $a_i$ is an integer multiple of it: $$ a_i = k_i\, m, \qquad k_i \in \mathbb{Z}, \quad i = 1, \dots, n. $$ Equivalently, all pairwise ratios are rational: $a_i / a_j \in \mathbb{Q}$ for all $i, j$ (with $a_j \neq 0$). Referenced by (1 direct)Direct references:proof-of-theorem-12

Theorem @theorem-12 The only $H$ satisfying the three required properties above is the entropy function defined above, up to multiplication by a constant. Proof @proof-of-theorem-12 Assume we have a function $A(n)$ that satisfies the three properties listed above, such that $$ A(n) = H(\frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n}). $$ Then, by property (3) above, we can decompose a choice from $s^m$ equally likely possibilities into a sequence of $m$ choices each from $s$ equally likely possibilities. For example, if we have $2^4$ equally likely possibilities, the probability of any given event is $1/16$. If we instead we have a series of $4$ choices each with $1/2$ probability, we end up with $1/16$ as the probability of any specific sequence of events. So, we have that $A(s^m) = m A(s).$ Now, with arbitrarily large $n,$ we can also have $t^n$ such that $A(t^n) = nA(t),$ by the same logic, and we can pick $m$ such that $$ s^m \leq t^n < s^{m+1}. $$ Now, we can take the logarithm of each term to get $$ m \log{s} \leq n \log{t} < (m+1)\log{s}, $$ and dividing by $n \log{s}$ gives $$ \frac{m}{n} \leq \frac{\log{t}}{\log{s}} < \frac{m}{n} + \frac{1}{n}, $$ and because $n$ is arbitrarily large, $$ \left | \frac{m}{n} - \frac{\log{t}}{\log{s}} \right | < \epsilon, $$ where $\epsilon$ is arbitrarily small. By property (2) of $A(n)$ (it is a @monotonically-increasing function of $n$,) $$ \begin{aligned} A(s^m) & \leq A(t^n) \leq A(s^{m+1}) \\ mA(s) & \leq nA(t) \leq (m+1)A(s). \end{aligned} $$ Then, dividing by $nA(s)$ gives $$ \frac{m}{n} \leq \frac{A(t)}{A(s)} \leq \frac{m}{n} + \frac{1}{n} \text{ or } \left | \frac{m}{n} - \frac{A(t)}{A(s)} \right | < \epsilon, $$ Now, by the @triangle-inequality, we have that $$ \begin{aligned} \left | \frac{A(t)}{A(s)} - \frac{\log{t}}{\log{s}} \right | & \leq 2 \epsilon \\ \left | A(t) - \frac{A(s)}{\log{s}} \log{t} \right | & \leq 2 \epsilon. \end{aligned} $$ Since $\epsilon$ can be arbitrarily small, we have that $A(t) = K \log(t),$ with $K > 0$ so that property (2) holds. Now we know what $A(n)$ is, and thus what $H$ is when we have equal probabilities for all events. Now let's say that we have a choice from $n$ possible events with commensurable probabilities $p_i = \frac{n_i}{\sum n_i}.$ We can break down a choice from $\sum n_i$ possibilities into a choice from $n$ possibilities with probabilities $p_1, \dots, p_n$ and then, if the $i$th possibility was chosen, $n_i$ choices of equal probability $p_i.$ We do this because above, we found how to find $H$ when all events are equally likely, and property (3) of our desired function lets us break down our overall choice from $\sum n_i$ possibilities. This gives us $$ \begin{aligned} K \log{\sum n_i} & = H(p_1, \dots, p_n) + \sum p_i A(n_i) \\ & = H(p_1, \dots, p_n) + \sum p_i K \log{n_i} \\ & = H(p_1, \dots, p_n) + K \sum p_i \log{n_i}. \\ \end{aligned} $$ Then, $$ \begin{aligned} H(p_1, \dots, p_n) & = K \log{\sum n_i} - K \sum p_i \log{n_i} \\ & = K \left [ \log{\sum n_i} - \sum p_i \log{n_i} \right ] \\ & = K \left [ \sum p_i \left ( \log{\sum n_i} \right ) - \sum p_i \left ( \log{n_i} \right ) \right ] \quad \text{ because } \sum p_i = 1 \\ & = K \left [ \sum p_i \left ( \log{\sum n_i} - \log{n_i} \right ) \right ] \\ & = K \left [ \sum p_i \left ( - \log{\frac{n_i}{\sum n_i}} \right ) \right ] \\ & = - K \sum p_i \log{p_i }. \\ \end{aligned} $$ If the $p_i$ are incommensurable, we can approximate them as closely as we'd like with @rationals, since The rationals are @dense in the reals.. By the first property we assumed for $H,$ it is continuous in the $p_i,$ and so its value at the incommensurable $p_i$ equals its limit as we approach via the @rationals, and so our expression holds in general. $K$ is left to us to pick, picking it is equivalent to picking a base for the logarithm. $\square$

Remark @remark-14 The entropy of a @random-variable quantifies the average level of uncertainty or information associated with the variable's possible outcomes. It measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states.

The demo below makes these quantities concrete. It uses a discrete alphabet of $N$ symbols whose probabilities follow a Zipfian distribution,

$$ p(n) = \frac{n^{-\alpha}}{\sum_{k=1}^{N} k^{-\alpha}}, \qquad n = 1, 2, \dots, N, $$

where the skew $\alpha$ controls how unevenly probability mass is spread across symbols. At $\alpha = 0$ every symbol is equally likely (the uniform distribution), so entropy is maximal at $\log N.$ As $\alpha$ grows the distribution concentrates on the first few symbols, the rare symbols carry ever more self-information, and the entropy falls toward $0.$ The bar chart shows the probability of each symbol alongside its self-information; the second chart traces the entropy as $\alpha$ sweeps from uniform to highly skewed.

Examples to cover: [https://claude.ai/chat/0702d3b2-a181-4b5b-a572-e5464d9d24d5]

Example: Coin Flip

For a fair coin flip, $\mathcal{X} = \{h, t\}$ and $p(h) = p(t) = 0.5.$ Then, $I(h) = I(t) = - \log{(0.5)} = 1 \text{ bit},$ and

$$ H(X) = p(h) \cdot I(h) + p(t) \cdot I(t) = 0.5 \cdot 1 + 0.5 \cdot 1 = 1 \text{ bit}. $$

Let's say we have an unfair coin and $p(h) = 0.9, p(t) = 0.1.$ Then

$$ I(h) = - \log{(0.9)} \approx 0.152 \text{ bits}, \quad I(t) = - \log{(0.1)} \approx 3.32 \text{ bits}, $$

and for entropy of $X$ we get

$$ H(X) = p(h) \cdot I(h) + p(t) \cdot I(t) \approx 0.9 \cdot 0.152 + 0.1 \cdot 3.32 \approx 0.469 \text{ bits}. $$

Example: Six-Sided Die

For a fair six-sided die, $\mathcal{X} = \{1,2,3,4,5,6\}$ and $p(x) = 1/6$ for all $x.$ Then, $I(x) = - \log{(1/6)} = \log(6) \approx 2.58 \text{ bits},$ and

$$ H(X) = 6 \cdot \frac{1}{6} \cdot I(x) = I(x) \approx 2.58 \text{ bits}. $$

Let's say we have an unfair die and $p(1) = 0.5$ and $p(x) = 0.1$ for $x \in \{2,3,4,5,6\}.$ Then

$$ I(1) = - \log{(0.5)} = 1 \text{ bit}, \quad I(x) = - \log{(0.1)} \approx 3.32 \text{ bits for } x \neq 1, $$

and for entropy of $X$ we get

$$ H(X) = p(1) \cdot I(1) + \sum_{x=2}^{6} p(x) \cdot I(x) \approx 0.5 \cdot 1 + 5 \cdot 0.1 \cdot 3.32 \approx 2.16 \text{ bits}. $$

Some Theorems on Entropy

Theorem @theorem-15 When a @random-variable is uniformly distributed over an alphabet of $n$ elements, the self-information of any given element equals the entropy of the @random-variable and is $\log{n}.$

Theorem @theorem-16 $H = 0$ iff all the $p_i$ but one are zero, this one having the value of one. Proof @proof-of-theorem-16 Suppose $H(p_1, \dots, p_n) = 0.$ Then, $0 = - \sum_{i=1}^n p_i \log{p_i}.$ Note that because $0 \leq p_i \leq 1,$ we have $\log{p_i} \leq 0,$ and $-p_i \log{p_i} \geq 0.$ Assume for contradiction that more than one $p_i$ is non-zero. Then, because $\sum p_i = 1,$ each non-zero $p_i$ is in $(0, 1)$ and therefore its $- p_i \log{p_i} > 0,$ and the sum of these terms is therefore non-zero, a contradiction. Now, since all probabilities are required to sum to 1, it can't be the case that all probabilities are zero, which means that exactly one probability must be non-zero and that probability must be $1.$ $\square$

Theorem @theorem-18 A @random-variable has the most entropy when the elements of its alphabet are all equally likely to occur. Proof @proof-of-theorem-18 We will use a @Lagrangian function of entropy to show this. Let $$ \mathcal{L}(\vec{p}, \lambda) = - \sum_i \left [ p_i \ln{p_i} \right ] - \lambda \left [ \left ( \sum_i p_i \right ) - 1 \right ]. $$ Then, for any $p_i$ we want $$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial p_i} & = -\ln{p_i} - 1 - \lambda = 0 \\ & \implies p_i = e^{-1 - \lambda}. \end{aligned} $$ But, this means $p_i$ doesn't depend on $i,$ and so is the same for each $i$ and therefore $p_i = \frac{1}{n}.$ It remains to show that this extreme of $H$ is a maximum. This follows from the facts that the domain is a @convex, compact set and that the entropy function is strictly @concave. TODO: more details on these. $\square$

Theorem @joint-entropy-is-less-than-or-equal-to-entropy-of-parts Let $X$ and $Y$ be discrete @random-variables with alphabets $\mathcal{X}$ and $\mathcal{Y},$ and let $p(x,y)$ be the probability of the joint occurrence of $X = x$ and $Y = y.$ The entropy of the joint event, $H(X,Y),$ is less than or equal to the sum of the individual entropies, i.e. $$ H(X,Y) \leq H(X) + H(Y), $$ with equality iff $X$ and $Y$ are independent, that is, iff $p(x,y) = p(x)p(y).$ Proof @proof-of-joint-entropy-is-less-than-or-equal-to-entropy-of-parts The entropy of the joint event is $$ H(X,Y) = - \sum_{x,y} p(x,y) \log{p(x,y)}. $$ This is just treating each possible pair of individual outcomes as its own outcome, i.e. we have $|\mathcal{X}| \times |\mathcal{Y}|$ possible outcomes. Writing $p(x) = \sum_{y'} p(x,y')$ and $p(y) = \sum_{x'} p(x',y)$ for the marginals (the argument names the distribution), we have $$ H(X) = - \sum_{x,y} p(x,y) \log{\sum_{y'}{p(x,y')}}, \quad H(Y) = - \sum_{x,y} p(x,y) \log{\sum_{x'}{p(x',y)}}. $$ So, $$ \begin{aligned} H(X) + H(Y) - H(X,Y) & = - \sum_{x,y} p(x,y) \log{\sum_{y'}{p(x,y')}} - \sum_{x,y} p(x,y) \log{\sum_{x'}{p(x',y)}} + \sum_{x,y} p(x,y) \log{p(x,y)} \\ & = \sum_{x,y} p(x, y) \left ( \log{p(x,y)} - \log{\sum_{y'}{p(x,y')}} - \log{\sum_{x'}{p(x',y)}} \right ) \\ & = \sum_{x,y} p(x, y) \left ( \log{p(x,y)} - \log{p(x)} - \log{p(y)} \right ) \\ & = \sum_{x,y} p(x, y) \left ( \log{\frac{p(x,y)}{p(x)p(y)}} \right ) \\ & = - \sum_{x,y} p(x, y) \left ( \log{\frac{p(x)p(y)}{p(x,y)}} \right ) \end{aligned} $$ Now, let's define a new random variable for convenience, $Z = \frac{p(X)p(Y)}{p(X,Y)}.$ We now have that $$ \begin{aligned} H(X) + H(Y) - H(X,Y) & = - \sum_{x,y} p(x, y) \left ( \log{\frac{p(x)p(y)}{p(x,y)}} \right ) \\ & = - \sum_{x,y} p(x, y) \left ( \log{Z} \right ) \\ & = -\mathbb{E}[\log{Z}]. \\ \end{aligned}$$ Now, by Jensen's Inequality, because The $\log$ @function is @concave, i.e. $-\log$..., we have that $\mathbb{E}[\log{Z}] \leq \log{\mathbb{E}[Z]}.$ Now, $$ \begin{aligned} \mathbb{E}[Z] & = \sum_{x,y} p(x,y) \cdot \frac{p(x)p(y)}{p(x,y)} \\ & = \sum_{x,y}p(x)p(y) \\ & = \big ( \sum_{x}p(x) \big ) \big ( \sum_{y}p(y) \big ) \\ & = 1 \cdot 1 \\ & = 1. \\ \end{aligned} $$ Therefore $\log{\mathbb{E}[Z]} = \log{1} = 0,$ and we have that $\mathbb{E}[\log{Z}] \leq 0 \implies -\mathbb{E}[\log{Z}] \geq 0,$ and $$ \begin{aligned} H(X) + H(Y) - H(X,Y) & \geq 0 \\ - H(X,Y) & \geq -H(X) -H(Y) \\ H(X,Y) & \leq H(X) + H(Y). \\ \end{aligned} $$ Now, for the equality part. Suppose $H(X) + H(Y) = H(X,Y).$ Then, $\mathbb{E}[\log{Z}] = 0,$ and recalling that $\mathbb{E}[Z] = 1,$ $\mathbb{E}[\log{Z}] = \log{\mathbb{E}[Z]} = 0.$ In Jensen's Inequality, equality holds only if our function is @affine or if $Z$ is constant; since $\log$ is @affine on no interval, $Z$ must be constant, and since $\mathbb{E}[Z] = 1,$ $Z = \frac{p(x)p(y)}{p(x,y)} = 1 \implies p(x)p(y) = p(x,y).$ Now, suppose $p(x)p(y) = p(x,y).$ Then $Z = 1$ and $$ H(X) + H(Y) - H(X,Y) = - \sum_{x,y} p(x, y) \left ( \log{1} \right ) = 0, $$ so $H(X) + H(Y) = H(X,Y).$ note that Jensen's Inequality gives us equality iff $Z$ is (almost surely) constant. Since $Z = \frac{p(x)p(y)}{p(x,y)},$ it's constant iff $p(x,y) = p(x)p(y).$ $\square$ Referenced by (1 direct)Direct references:proof-of-theorem-33

Note @note-22 When we decrease the probability of an element occurring, its self-information increases only logarithmically, while its weight in the entropy sum decreases linearly.

Definition: Cross-Entropy @cross-entropy Given two discrete probability distributions, $p$ and $q,$ that share a support $\mathcal{X},$ cross-entropy measures the average number of bits needed to represent an event drawn from $\mathcal{X}$ when the coding scheme is optimized for an estimated distribution $q$ rather than the true distribution $p.$ Its formula is very similar to that of entropy, except surprisal is calculated using $q$ while expectation uses $p$: $$ H(p,q) = - \sum_{x \in \mathcal{X}} p(x) \log{q(x)}. $$ We can write it in expectation form as $$ H(p,q) = \mathbb{E}_p \left [- \log{q} \right ], $$ where $\mathbb{E}_p$ is expectation under $p.$ Referenced by (3 direct, 1 transitive)Direct references:Discrete EntropyKL DivergenceModel TrainingTransitive (depth 1):remark-25

Note that when $p = q,$ cross-entropy equals entropy. The symbol $H$ is overloaded: $H(p,q)$ with distributions as arguments means cross-entropy, while $H(X,Y)$ with @random-variables as arguments means joint entropy.

Definition: KL Divergence (also: relative-entropy, Kullback-Leibler divergence) @kl-divergence For discrete probability distributions, $p$ and $q,$ that share a support $\mathcal{X},$ the relative entropy from $q$ to $p$ is defined to be $$ D_{KL}(p || q) = \sum_{x \in \mathcal{X}} p(x) \log{\frac{p(x)}{q(x)}}, $$ which is just cross-entropy minus entropy: $$ D(p || q) = H(p,q) - H(p). $$ Note that we don't always use the $KL$ subscript where it's obvious from the context. We can also write it as $$ D(p || q) = \mathbb{E}_p \left [ - \log{q} \right ] - \mathbb{E}_p \left [ - \log{p} \right ] = \mathbb{E}_p \left [ \log{\frac{p}{q}} \right ], $$ which makes it clear that it's the expected inefficiency in an encoding optimized for $q$ rather than $p.$ Referenced by (1 direct)Direct references:remark-25

Remark @remark-25 kl divergence tells us how different an approximating distribution $q$ is from a true probability distribution $p.$ It's not a true distance metric - it's asymmetric, but it does behave like distance in that it gets smaller as $q$ gets more like $p$ and is zero when $q = p.$ It is a sort of directed distance. Another way to think about it is as a measure of how inefficient a coding scheme optimized for $q$ is when the actual distribution is $p.$ If $q$ says some event is rare, and so we use a longer symbol for it (more bits), but the event actually occurs frequently, we'll waste bits representing the event when we could have used a shorter symbol for it.

Theorem: Doubly stochastic maps increase entropy @theorem-26 Take a @doubly-stochastic @matrix $A$ and @probability-vector $p,$ let $q = A p.$ Then $H(q) \geq H(p),$ with equality iff $q$ is a @rearrangement of $p$. Proof @proof-of-theorem-26 Note that $q_i = \sum_{j} A_{ij} p_j,$ by the definition of vector matrix multiplication. Now we'll define a couple of joint distributions: $$ r_{ij} = A_{ij} p_j, \quad s_{ij} = A_{ij} q_i. $$ Now we find the relative-entropy from $s$ to $r$: $$ \begin{aligned} D(r || s) & = \sum_{i,j} r_{ij} \log{\frac{r_{ij}}{s_{ij}}} \\ & = \sum_{i,j} A_{ij} p_j \log{\frac{A_{ij} p_j}{A_{ij} q_i }} \\ & = \sum_{i,j} A_{ij} p_j \log{\frac{p_j}{q_i}} \\ & = \sum_{i,j} A_{ij} p_j \left ( \log{p_j} - \log{q_i} \right ) \\ & = \sum_{i,j} A_{ij} p_j \log{p_j} - \sum_{i,j} A_{ij} p_j \log{q_i} \\ \end{aligned} $$ Now, addressing the first term on the RHS of the last line: $$ \begin{aligned} \sum_{i,j} A_{ij} p_j \log{p_j} & = \sum_j \left [ p_j \log{p_j} \sum_{i} A_{ij} \right ] \\ & = \sum_j \left [ p_j \log{p_j} \cdot 1 \right ] \\ & = \sum_j p_j \log{p_j} \\ & = -H(p). \\ \end{aligned} $$ And the second term on the RHS of the last line: $$ \begin{aligned} \sum_{i,j} A_{ij} p_j \log{q_i} & = \sum_{i} \left [ \log{q_i} \sum_{j} A_{ij} p_j \right ] \\ & = \sum_{i} \left [ \log{q_i} q_i \right ] \\ & = -H(q). \end{aligned} $$ So, we end up with $D(r || s) = H(q) - H(p).$ Since $D(r || s) \geq 0,$ we have $H(q) - H(p) \geq 0,$ which implies that $H(q) \geq H(p),$ which is what we wanted to show. Now, the equivalence case. If $q$ is just a rearrangement of $p,$ then obviously it has the same entropy as we can just re-index to recover $p.$ Now assume $H(p) = H(q).$ Then, $H(q) - H(p) = 0 \implies D(r || s) = 0 \implies r = s \implies A_{ij} p_j = A_{ij} q_i \implies p_j = q_i,$ so $q$ is a rearrangement of $p.$ $\square$ Intuition @theorem-26-intuition Averaging (well, replacing each probability with a convex combination of all the probabilities in a way that retains their summing to 1) the probabilities in $p$ can only bring them closer to each other (if it doesn't just rearrange them), which increases entropy.

Definition: Conditional Entropy @conditional-entropy Let $X$ and $Y$ be @random-variables, not necessarily @independent. The probability that $Y$ takes the value $y$ when $X$ takes the value $x$ (for any specific $x \in \mathcal{X}, y \in \mathcal{Y}$) is called the @conditional-probability $p(y|x)$ and is given by: $$ p(y|x) = \frac{p(x, y)}{p(x)}, \quad p(x) = \sum_{y} p(x,y). $$ (Note that $p(x,y)$ is the @joint-probability of $X = x, Y = y.$) We define the conditional entropy of $Y$ given $X,$ $H(Y|X),$ as the average of the entropy of $Y$ for each value $x$ of $X$ weighted according to the probability of getting that particular $x.$ That is, $$ H(Y|X) = -\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x,y) \log{p(y|x)}. $$ Referenced by (1 direct)Direct references:note-30

Note @note-30 We can obtain the above formula for conditional entropy as follows. First, if $X = x,$ then our entropy for $Y$ is $$ H(Y | X = x) = -\sum_{y} p(y|x) \log{p(y|x)}. $$ Now, summing across $X$ gives: $$ \begin{aligned} H(Y|X) & = \sum_{x} p(x) H(Y | X = x) \\ & = \sum_{x} p(x) \left [ -\sum_{y} p(y|x) \log{p(y|x)} \right ] \\ & = - \sum_{x} \sum_{y}\left [ p(x) p(y|x) \log{p(y|x)} \right ] \\ & = - \sum_{x, y} p(x,y) \log{p(y|x)}. \\ \end{aligned} $$

Theorem @joint-entropy-is-base-plus-conditional The entropy of the joint event $X,Y$ is the entropy of $X$ plus the entropy of $Y$ when $X$ is known. Proof @proof-of-joint-entropy-is-base-plus-conditional Note that $$ \begin{aligned} H(Y | X) & = - \sum_{x, y} p(x,y) \log{\frac{p(x,y)}{p(x)}} \\ & = - \sum_{x, y} p(x,y) \log{p(x,y)} + \sum_{x, y} p(x,y) \log{p(x)}.\\ \end{aligned} $$ The first term on the RHS is just $- \sum_{x, y} p(x,y) \log{p(x,y)} = H(X,Y).$ The second is $$ \begin{aligned} \sum_{x, y} p(x,y) \log{p(x)} & = \sum_{x} \left [ \sum_{y} p(x,y) \right ] \log{p(x)} \\ & = \sum_{x} p(x) \log{p(x)} \\ & = -H(X). \end{aligned} $$ Altogether, we have that $H(Y | X) = H(X,Y) - H(X),$ or $$ H(X,Y) = H(X) + H(Y | X). $$ $\square$ Referenced by (1 direct)Direct references:proof-of-theorem-33

Theorem @theorem-33 The uncertainty of $Y$ is never increased by knowledge of $X.$ It will be decreased unless $X$ and $Y$ are independent events, in which case it is not changed. Proof @proof-of-theorem-33 From Let $X$ and $Y$ be discrete @random-variables... and The @entropy of the joint event $X,Y$..., we have $$ \begin{aligned} H(X) + H(Y) & \geq H(X,Y) \\ & = H(X) + H(Y|X), \\ \end{aligned} $$ hence $H(Y) \geq H(Y|X).$ $\square$