Mrs. Gerber's Lemma: X is a length-n binary word with arbitrary distribution and Y is a corruption of X through a binary symmetric channel with crossover probability p. If H(X) \geq nv then H(Y) \geq nh(v') with v':=p \ast h^{-1}(v), where h^{-1}:[0,1]\to[0,\frac{1}{2}] is the inverse entropy of a Bernoulli RV and a\ast b := a\cdot(1-b) + (1-a)\cdot b.
Mrs. Gerber's lemma says that the channel's space of typical outputs when inputting X is bigger than had the input been an unstructured vector U whose n components are iid with the same entropy-rate as X.
Lay intuition says unstructured random variables have higher entropy than structured ones. Without knowing better, we'd expect the inequality in Mrs. Gerber's lemma to face in the opposite direction. What is going on here?
Consider the set of typical channel inputs after many experiment repetitions. Call the typical outcomes of X "S_X". Call the typical outcomes of U "S_U" and notice S_U is like a Hamming ball. By choice S_U and S_X contain the same amount of words.
The channel noise by itself will look like some Hamming ball S_N around the 0 word.
Now consider the typical channel outputs. It looks like taking each input word and expanding it into a Hamming ball, then unioning all the balls afterwards. With this in mind, which set is going to achieve fuller coverage of the output word space?
- S_X+S_N (points of S_X can be spread out)
- S_U+S_N (points of S_U are as close together as can be)
We can think of outputs as looking like (dictionary word)+(noise).
In the case of X these dictionary words might be spaced quite far apart. In the unstructured case the signal and noise are genuinely undifferentiated. Both dictionaries have the same number of words, but the structure in X allows Y to expose more of the input's entropy. This can be inferred from a statement here.