Mrs. Gerber's Lemma: X is a length-n binary word with arbitrary distribution and Y is a corruption of X through a binary symmetric channel with crossover probability p. If H(X) \geq nv then H(Y) \geq nh(v') with v':=p \ast h^{-1}(v), where h^{-1}:[0,1]\to[0,\frac{1}{2}] is the inverse entropy of a Bernoulli RV and a\ast b := a\cdot(1-b) + (1-a)\cdot b.

Mrs. Gerber's lemma says that the channel's output Y is bigger than had the input been an unstructured vector U whose n components are iid with the same entropy-rate as X.

Lay intuition says unstructured random variables have higher entropy than structured ones. Without knowing better, we'd expect the inequality in Mrs. Gerber's lemma to face in the opposite direction. What is going on here?

We can think of outputs as looking like (dictionary word)+(noise).

In the case of X these dictionary words might be spaced quite far apart. In the unstructured case the signal and noise are genuinely undifferentiated. Both dictionaries have the same number of words, but the structure in X allows Y to expose more of the system's entropy. This can be inferred from a statement here.

Consider the set of typical channel inputs after many repetitions of the experiment. Call the typical outcomes of X "S_X". Call the typical outcomes of U "S_U" and notice S_U is like a Hamming ball. By choice S_U and S_X contain the same amount of words.

Also consider the typical channel outputs. The collection of typical channel outputs looks like taking each of the input words and expanding them into Hamming balls.

For convenience say the channel noise is some Hamming ball N around 0. Now which set is going to achieve fuller coverage of the set of all possible output words: the Hamming ball N around S_X, where points of S_X are allowed to be spread out, or the sum of two Hamming balls, S_U+N where all words are maximally close together?