In this post I will talk about conditional expectation and disintegration of a measure with respect to a -algebra. All this is classical probability theory but I think not many people (me included) come across this in a standard course in probability. These tools and ideas are quite useful in ergodic theory. An example of this is the proof I present on this post of the ergodic decomposition theorem.

** — 1. Conditional expectation — **

I will start by trying to give some intuition about the notion of conditional expectation with an example. Suppose that you want to know how many people will visit a particular beach on a given day (say you sell ice-cream). This is a random variable . The first approximation you can make for the expected value for is the statistical average. However, a better approximation for can be made if you take other factors into account. For instance, let’s say that the number of visitors depends on the temperature of the given day and suppose that you know at the beginning of the day what the temperature will be. Thus, what you want to know is how many visitors to expect, conditional on the event that the temperature is (say). Let’s call the random variable that represents the temperature . The expected number of visitors on a day where the temperature is can be denoted by or . The function is what is called the conditional expectation of with respect to .

Note that that (or, more precisely, ) can also be thought of as a random variable (it is a number that depends on the random temperature ). Indeed this random variable is the best approximation for if all you know is the temperature . It is not hard to see that the expectation of (the average over all possible temperatures of the expected number of visitors in a day with that temperature) is exactly the expectation of (the expected number of visitors, regardless of temperature).

Definition 1Let be a probability space, let be a -algebra and let be a random variable. The conditional expectation of with respect to is the function such that for all we have

Informally, the conditional expectation is the function in that better approximates . This sentence is made more precise below on Theorem 6. From an information theory point of view, is the best guess of the value of when all the information we have is . For instance if we have no information at all (so ), then is the constant function , and that’s the best guess one can have for . On the other extreme situation, when we have complete information (i.e. when ) then and our “guess” for is itself.

We need to show that the conditional expectation exists and is unique in the space :

Proposition 2Let be a probability space, let be a -algebra and let be a random variable. The conditional expectation exists and is unique.

*Proof:* To prove existence we define the complex measure by for every (if you are not comfortable with complex measures, split with each a non-negative real valued function and apply the proof to each separately). It is easy to check that this is indeed a complex measure. Moreover if is such that then as well. In other words we have . Therefore we can apply the Radon-Nikodym theorem to find a derivative . By definition of we have that for every

Thus is a conditional expectation of with respect to . To prove uniqueness, assume that are both conditional expectations of with respect to . Then for each we have which implies .

If the set has positive measure then without loss of generality the set has positive measure. Hence there is some such that the set has positive measure. But since both and are measurable in we conclude that and hence

which is a contradiction. This shows the uniqueness of .

We will need the following basic fact about conditional expectations:

Proposition 3Let be a probability space, let be a -algebra and let be a real-valued random variable. The conditional expectation satisfies

Indeed if takes values in a convex set in a Banach space, then also takes values on that convex set, but the proof is technically more cumbersome.

*Proof:* We prove only the first inequality, the second can be easily derived from the first one by considering the random variable . Fix and let . We have

which simplifies to , and thus . Since was arbitrary we conclude that almost surely .

Lemma 4Let be a probability space, let be a -algebra and let . Then almost everywhere

*Proof:* Let and . Let be the set of points where the inequality (1) fails. Let and let . We have

Since is the set of points where the inequality (1) fails, we conclude that as desired.

Proposition 5Let be a probability space and let be a -algebra. The operator is continuous.

*Proof:* We show that actually the norm of the operator is : Let . By Lemma 4 we have

Finally I will present another way to think about conditional expectation for a function . In this case we can use the Hilbert space structure to give a different characterization of the conditional expectation.

Theorem 6Let be a probability space and let be a -algebra. Let be the orthogonal projection (observe that is a closed subspace of . Then for every let we have .

*Proof:* By definition of orthogonal projection, for any function we have . If then the indicator function of is in . Therefore

and hence as desired.

** — 2. Disintegration of measures — **

A good example to keep in mind when talking about disintegration of measures is the following: Let be the lower triangle on the unit square, let be the Borel -algebra over and let be the dimensional Lebesgue measure. Now let be the -algebra defined by if and only if and is the union of vertical lines (more precisely, for all point we have for all ).

Let be the restriction of to , let be the Borel -algebra and let be the measure on that has density with respect to the Lebesgue measure. Then the probability space is equivalent to the system . More precisely, the map from to is an isomorphism of probability spaces. The meaning of this is quite intuitive: induces a bijection of the algebras that matches sets with the same measure.

Moreover, we can recover the measure on by the formula

where is a probability measure on . More precisely, is the normalized Lebesgue measure on , so that for any interval .

Note that if we started with the unit square instead of the triangle, then all the measures would be the same, and equation (2) is essentially Fubini’s theorem. Thus the disintegration of measures can be view as an inverted version of Fubini’s theorem.

Theorem 7Let be a compact metric space, let be the -algebra of Borel sets on and let be a probability measure. Let be a -algebra. Then for almost every there exists a probability measure on such that for every :

This result applies more generally than to compact metric spaces, but this restriction makes the proof technically easier.

*Proof:* Since is a compact metric space, the space of continuous functions from to is separable (with the topology of uniform convergence, equivalently, the supremum norm). Let be a countable dense set in . For each , the conditional expectation is defined -a.e. on . Thus there is a set of full measure such that is defined on for all .

For each define . Note that, by Proposition 3 we have

Thus can be extended to a continuous functional on . By the Riesz representation theorem there exists a measure on such that .

For each we have that , hence the function is in . Since is a dense set in and is a dense set in , we conclude that is a dense set in . It follows from Proposition 5 that the function is in for any .

Finally for each we have

and since the sequence is a dense set in we conclude that (3) holds for any . Now given , one can find such that a.e. and . From Proposition 5 this implies that and hence a.e. For each for which this series converges and we have

We conclude that . Moreover, since (3) holds for each , it is easy to deduce that (3) holds for .

Pingback: Ergodic Decomposition | I Can't Believe It's Not Random!

Pingback: Factors and joinings of measure preserving systems | I Can't Believe It's Not Random!

Pingback: Szemerédi’s Theorem Part III – Precise definitions | I Can't Believe It's Not Random!

I am not sure how the last steps work. So is equal to . In order to prove this equality for all , choose a subsequence that converges to in sense. I see that converges to . But I don’t see how converges to .

Hm, the way I wrote it, it’s not very clear.

I think what I had in mind was to prove that first for every and then invoke Riesz representation theorem again. I will rewrite that part to make it more clear.

Pingback: Szemerédi Theorem Part VI – Dichotomy between weak mixing and compact extension | I Can't Believe It's Not Random!