Extracting Relevant Information over Many Trials

The information bottleneck scenario goes like this:

We are interested in a quantity Y but only observe X which we must process down to a rate R \in (0, H(X)). To this end, what random variables (U,Q) minimize H(Y|U,Q) while satisfying U\perp Y | X, Q\perp (X,Y) and H(U|Q)<R?

This minimization is characterized by a functional equation discussed in the above link. Does the situation change when you are allowed to process over many random trials instead of just one? An example application might go like this:

Is it easier to specify the result of 1000 coin tosses with 500 bits than it is to describe 1 coin toss with half a bit, when you only learn about each coin toss outcome through independent identically distributed noise?

The answer is no, and the many trials/coin tosses scenario's solution has the same essential form as the single-letter information bottleneck.

Write n iid trials of (X,Y) as (X^n, Y^n) and fix R\in (0,H(X)). Then \lim _ n \min _ {f: \mathrm{Range}\ f = [\exp nR]} \frac{1}{n} H(Y^n|f(X^n)) = \min _ {\substack{U: U-X-Y \\ I(U;X) \leq R }} H(Y|U).

Tom has written a proof sketch here:

Link to mathoverflow

Comments