## Disintegration of measures

In this post I will talk about conditional expectation and disintegration of a measure with respect to a ${\sigma}$-algebra. All this is classical probability theory but I think not many people (me included) come across this in a standard course in probability. These tools and ideas are quite useful in ergodic theory. An example of this is the proof I present on this post of the ergodic decomposition theorem.

— 1. Conditional expectation —

I will start by trying to give some intuition about the notion of conditional expectation with an example. Suppose that you want to know how many people will visit a particular beach on a given day (say you sell ice-cream). This is a random variable ${X}$. The first approximation you can make for the expected value for ${X}$ is the statistical average. However, a better approximation for ${X}$ can be made if you take other factors into account. For instance, let’s say that the number of visitors depends on the temperature of the given day and suppose that you know at the beginning of the day what the temperature will be. Thus, what you want to know is how many visitors to expect, conditional on the event that the temperature is ${t}$ (say). Let’s call the random variable that represents the temperature ${T}$. The expected number of visitors on a day where the temperature is ${t}$ can be denoted by ${\mathop{\mathbb E}[X\mid T=t]}$ or ${\mathop{\mathbb E}[X\mid T](t)}$. The function ${\mathop{\mathbb E}[X\mid T]}$ is what is called the conditional expectation of ${X}$ with respect to ${T}$.

Note that that ${\mathop{\mathbb E}[X\mid T]}$ (or, more precisely, ${\mathop{\mathbb E}[X\mid T](T)}$) can also be thought of as a random variable (it is a number that depends on the random temperature ${T}$). Indeed this random variable is the best approximation for ${X}$ if all you know is the temperature ${T}$. It is not hard to see that the expectation of ${\mathop{\mathbb E}[X\mid T]}$ (the average over all possible temperatures of the expected number of visitors in a day with that temperature) is exactly the expectation of ${X}$ (the expected number of visitors, regardless of temperature).

Definition 1 Let ${(X,{\cal B},\mu)}$ be a probability space, let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra and let ${f\in L^1({\cal B})}$ be a random variable. The conditional expectation of ${f}$ with respect to ${{\cal D}}$ is the function ${\mathop{\mathbb E}[f\mid{\cal D}]\in L^1({\cal D})}$ such that for all ${D\in{\cal D}}$ we have

$\displaystyle \int_Dfd\mu=\int_D\mathop{\mathbb E}[f\mid{\cal D}]d\mu$

Informally, the conditional expectation ${\mathop{\mathbb E}[f\mid{\cal D}]}$ is the function in ${L^1(X,{\cal D})}$ that better approximates ${f}$. This sentence is made more precise below on Theorem 6. From an information theory point of view, ${\mathop{\mathbb E}[f\mid{\cal D}](x)}$ is the best guess of the value of ${f(x)}$ when all the information we have is ${{\cal D}}$. For instance if we have no information at all (so ${{\cal D}=\{\emptyset,X\}}$), then ${\mathop{\mathbb E}[f\mid{\cal D}]}$ is the constant function ${\mathop{\mathbb E}[f]}$, and that’s the best guess one can have for ${f(x)}$. On the other extreme situation, when we have complete information (i.e. when ${{\cal D}={\cal B}}$) then ${E[f\mid{\cal D}]=f}$ and our “guess” for ${f(x)}$ is ${f(x)}$ itself.

We need to show that the conditional expectation exists and is unique in the space ${L^1({\cal D})}$:

Proposition 2 Let ${(X,{\cal B},\mu)}$ be a probability space, let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra and let ${f\in L^1({\cal B})}$ be a random variable. The conditional expectation ${\mathop{\mathbb E}[f\mid{\cal D}]\in L^1({\cal D})}$ exists and is unique.

Proof: To prove existence we define the complex measure ${\nu:{\cal D}\rightarrow{\mathbb C}}$ by ${\nu(D)=\int_Dfd\mu}$ for every ${D\in{\cal D}}$ (if you are not comfortable with complex measures, split ${f=f_1-f_2+if_3-if_4}$ with each ${f_i}$ a non-negative real valued function and apply the proof to each ${f_i}$ separately). It is easy to check that this is indeed a complex measure. Moreover if ${D\in{\cal D}}$ is such that ${\mu(D)=0}$ then ${\nu(D)=0}$ as well. In other words we have ${\nu\ll\mu}$. Therefore we can apply the Radon-Nikodym theorem to find a derivative ${g=\frac{d\nu}{d\mu}\in L^1({\cal D})}$. By definition of ${g}$ we have that for every ${D\in{\cal D}}$

$\displaystyle \int_Dgd\mu=\nu(D)=\int_Dfd\mu$

Thus ${g}$ is a conditional expectation of ${f}$ with respect to ${{\cal D}}$. To prove uniqueness, assume that ${g,h\in L^1({\cal D})}$ are both conditional expectations of ${f}$ with respect to ${{\cal D}}$. Then for each ${D\in{\cal D}}$ we have ${\int_Dhd\mu=\int_Dfd\mu=\int_Dgd\mu}$ which implies ${\int_Dh-gd\mu=0}$.

If the set ${\{x\in X:h(x)-g(x)\neq0\}}$ has positive measure then without loss of generality the set ${\{x\in X:h(x)-g(x)>0\}}$ has positive measure. Hence there is some ${\epsilon>0}$ such that the set ${D:=\{x\in X:h(x)-g(x)>\epsilon\}}$ has positive measure. But since both ${h}$ and ${g}$ are measurable in ${{\cal D}}$ we conclude that ${D\in{\cal D}}$ and hence

$\displaystyle 0=\int_Dh-gd\mu\geq\epsilon\mu(D)>0$

which is a contradiction. This shows the uniqueness of ${\mathop{\mathbb E}[f\mid{\cal D}]}$. $\Box$

We will need the following basic fact about conditional expectations:

Proposition 3 Let ${(X,{\cal B},\mu)}$ be a probability space, let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra and let ${f\in L^1({\cal B})}$ be a real-valued random variable. The conditional expectation ${\mathop{\mathbb E}[f\mid{\cal D}]}$ satisfies

$\displaystyle \inf_{x\in X}f(x)\leq\mathop{\mathbb E}[f\mid{\cal D}](y)\leq\sup_{x\in X}f(x)\qquad a.s.$

Indeed if ${f}$ takes values in a convex set in a Banach space, then also ${\mathop{\mathbb E}[f\mid{\cal D}]}$ takes values on that convex set, but the proof is technically more cumbersome.

Proof: We prove only the first inequality, the second can be easily derived from the first one by considering the random variable ${-f(x)}$. Fix ${\epsilon>0}$ and let ${\displaystyle D:=\left\{y\in X:\mathop{\mathbb E}\big[f\mid{\cal D}\big](y)<\inf_{x\in X}f(x)-\epsilon\right\}\in{\cal D}}$. We have

$\displaystyle \mu(D)\left(\inf_{x\in X}f(x)-\epsilon\right)\geq\int_D\mathop{\mathbb E}\big[f\mid{\cal D}\big]d\mu=\int_Dfd\mu\geq\mu(D)\inf_{x\in X}f(x)$

which simplifies to ${\epsilon\mu(D)\leq0}$, and thus ${\mu(D)=0}$. Since ${\epsilon>0}$ was arbitrary we conclude that almost surely ${\inf_{x\in X}f(x)\leq\mathop{\mathbb E}[f\mid{\cal D}](y)}$. $\Box$

Lemma 4 Let ${(X,{\cal B},\mu)}$ be a probability space, let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra and let ${f\in L^1(X,{\cal B})}$. Then almost everywhere

$\displaystyle \left|\mathop{\mathbb E}[f\mid{\cal D}]\right|\leq\mathop{\mathbb E}[|f|\mid{\cal D}]\ \ \ \ \ (1)$

Proof: Let ${A:=\{x\in X:\mathop{\mathbb E}[f\mid{\cal D}]>0\}}$ and ${B:=X\setminus A=\{x\in X:\mathop{\mathbb E}[f\mid{\cal D}]\leq0\}}$. Let ${D\in{\cal D}}$ be the set of points where the inequality (1) fails. Let ${D^+=D\cap A}$ and let ${D^-=D\cap B}$. We have

$\displaystyle \begin{array}{rcl} \displaystyle\int_D\left|\mathop{\mathbb E}[f\mid{\cal D}]\right|d\mu&=&\displaystyle\int_{D^+}\left|\mathop{\mathbb E}[f\mid{\cal D}]\right|d\mu+\int_{D^-}\left|\mathop{\mathbb E}[f\mid{\cal D}]\right|d\mu\\&=&\displaystyle\left|\int_{D^+}\mathop{\mathbb E}[f\mid{\cal D}]d\mu\right|+\left|\int_{D^-}\mathop{\mathbb E}[f\mid{\cal D}]d\mu\right|\\&=&\displaystyle\left|\int_{D^+}fd\mu\right|+\left|\int_{D^-}fd\mu\right|\\&\leq&\displaystyle\int_{D^+}|f|d\mu+\int_{D^-}|f|d\mu\\&=&\displaystyle\int_{D^+}\mathop{\mathbb E}[|f|\mid{\cal D}]d\mu+\int_{D^-}\mathop{\mathbb E}[|f|\mid{\cal D}]d\mu\\&=&\displaystyle\int_D\mathop{\mathbb E}[|f|\mid{\cal D}]d\mu\end{array}$

Since ${D}$ is the set of points where the inequality (1) fails, we conclude that ${\mu(D)=0}$ as desired.

$\Box$

Proposition 5 Let ${(X,{\cal B},\mu)}$ be a probability space and let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra. The operator ${\mathop{\mathbb E}[.\mid{\cal D}]:L^1(X,{\cal B})\rightarrow L^1(X,{\cal D})}$ is continuous.

Proof: We show that actually the norm of the operator is ${1}$: Let ${f\in L^1(X,{\cal B})}$. By Lemma 4 we have

$\displaystyle \left\|\mathop{\mathbb E}[f\mid{\cal D}]\right\|=\int_X\left|\mathop{\mathbb E}[f\mid{\cal D}]\right|d\mu\leq\int_X\mathop{\mathbb E}\left[|f|\mid{\cal D}\right]d\mu=\int_X|f|d\mu=\|f\|$

$\Box$

Finally I will present another way to think about conditional expectation for a function ${f\in L^2(X,{\cal B})}$. In this case we can use the Hilbert space structure to give a different characterization of the conditional expectation.

Theorem 6 Let ${(X,{\cal B},\mu)}$ be a probability space and let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra. Let ${P:L^2(X,{\cal B})\rightarrow L^2(X,{\cal D})}$ be the orthogonal projection (observe that ${L^2(X,{\cal D})}$ is a closed subspace of ${L^2(X,{\cal B})}$. Then for every let ${f\in L^2(X,{\cal B})}$ we have ${\mathop{\mathbb E}[f\mid{\cal D}]=Pf}$.

Proof: By definition of orthogonal projection, for any function ${g\in L^2(X,{\cal D})}$ we have ${\langle f-Pf,g\rangle=0}$. If ${D\in{\cal D}}$ then the indicator function ${1_D}$ of ${D}$ is in ${L^2(X,{\cal D})}$. Therefore

$\displaystyle \int_DPfd\mu=\int_X1_DPfd\mu=\langle 1_D,Pf\rangle=\langle 1_D,f\rangle=\int_X1_Dfd\mu=\int_Dfd\mu$

and hence ${Pf=\mathop{\mathbb E}[f\mid{\cal D}]}$ as desired. $\Box$

— 2. Disintegration of measures —

A good example to keep in mind when talking about disintegration of measures is the following: Let ${X=\{(x,y)\in[0,1]^2:y\leq x\}}$ be the lower triangle on the unit square, let ${{\cal B}}$ be the Borel ${\sigma}$-algebra over ${X}$ and let ${\mu}$ be the ${2}$ dimensional Lebesgue measure. Now let ${{\cal D}}$ be the ${\sigma}$-algebra defined by ${D\in{\cal D}}$ if and only if ${D\in{\cal B}}$ and ${D}$ is the union of vertical lines (more precisely, for all point ${(x,y)\in D}$ we have ${(x,z)\in D}$ for all ${z\in[0,x]}$).

Let ${\nu}$ be the restriction of ${\mu}$ to ${{\cal D}}$, let ${{\cal A}}$ be the Borel ${\sigma}$-algebra and let ${\lambda}$ be the measure on ${[0,1]}$ that has density ${f(x)=x}$ with respect to the Lebesgue measure. Then the probability space ${(X,{\cal D},\nu)}$ is equivalent to the system ${([0,1],{\cal A},\lambda)}$. More precisely, the map ${\pi:(x,y)\mapsto x}$ from ${(X,{\cal D},\mu)}$ to ${([0,1],{\cal A},\lambda)}$ is an isomorphism of probability spaces. The meaning of this is quite intuitive: ${\pi}$ induces a bijection of the ${\sigma}$ algebras that matches sets with the same measure.

Moreover, we can recover the measure ${\mu}$ on ${{\cal B}}$ by the formula

$\displaystyle \int_X fd\mu=\int_{[0,1]}\left(\int_{[0,x]}f(x,y)d\ell_x(y)\right)d\lambda(x)\qquad\qquad f\in L^1(X,\mu) \ \ \ \ \ (2)$

where ${\ell_x}$ is a probability measure on ${\pi^{-1}(\{x\})=\{x\}\times[0,x]}$. More precisely, ${\ell_x}$ is the normalized Lebesgue measure on ${\{x\}\times[0,x]}$, so that ${\ell_x([a,b])=(b-a)/x}$ for any interval ${[a,b]\subset[0,x]}$.

Note that if we started with the unit square instead of the triangle, then all the measures ${\ell_x}$ would be the same, and equation (2) is essentially Fubini’s theorem. Thus the disintegration of measures can be view as an inverted version of Fubini’s theorem.

Theorem 7 Let ${X}$ be a compact metric space, let ${{\cal B}}$ be the ${\sigma}$-algebra of Borel sets on ${X}$ and let ${\mu:{\cal B}\rightarrow[0,1]}$ be a probability measure. Let ${{\cal D}\subset{\cal B}}$ be a ${\sigma}$-algebra. Then for almost every ${y\in X}$ there exists a probability measure ${\mu_y}$ on ${(X,{\cal B})}$ such that for every ${f\in L^1(X,{\cal B},\mu)}$:

• The function ${y\mapsto\int_Xf(x)d\mu_y(x)}$ is in ${L^1(X,{\cal D})}$.
• $\displaystyle \int_Xf(x)d\mu(x)=\int_X\left(\int_Xf(x)d\mu_y(x)\right)d\mu(y) \ \ \ \ \ \ \ \ \ \ \ (3)$

This result applies more generally than to compact metric spaces, but this restriction makes the proof technically easier.

Proof: Since ${X}$ is a compact metric space, the space ${C(X)}$ of continuous functions from ${X}$ to ${{\mathbb R}}$ is separable (with the topology of uniform convergence, equivalently, the supremum norm). Let ${(f_n)_{n=1}^\infty}$ be a countable dense set in ${C(X)}$. For each ${n\in{\mathbb N}}$, the conditional expectation ${\mathop{\mathbb E}[f_n\mid{\cal D}]}$ is defined ${\mu}$-a.e. on ${X}$. Thus there is a set of full measure ${Y\subset X}$ such that ${\mathop{\mathbb E}[f_n\mid{\cal D}]}$ is defined on ${Y}$ for all ${n\in{\mathbb N}}$.

For each ${y\in Y}$ define ${L_y(f_n)=\mathop{\mathbb E}[f_n\mid{\cal D}](y)}$. Note that, by Proposition 3 we have

$\displaystyle \left|L_y(f_n)\right|=\left|\mathop{\mathbb E}[f_n\mid{\cal D}](y)\right|\leq\sup_{x\in X}\left|\mathop{\mathbb E}[f_n\mid{\cal D}](x)\right|\leq\sup_{x\in X}\left|f_n(x)\right|=\|f_n\|$

Thus ${L_y}$ can be extended to a continuous functional on ${C(X)}$. By the Riesz representation theorem there exists a measure ${\mu_y}$ on ${X}$ such that ${L_y(f)=\int_Xfd\mu_y}$.

For each ${n\in{\mathbb N}}$ we have that ${\int_Xf_n(x)d\mu_y(x)=L_y(f_n)=\mathop{\mathbb E}[f_n\mid{\cal D}](y)}$, hence the function ${y\mapsto\int_Xf_n(x)d\mu_y(x)}$ is in ${L^1(X,{\cal D})}$. Since ${(f_n)_{n=1}^\infty}$ is a dense set in ${C(X)}$ and ${C(X)}$ is a dense set in ${L^1(X,{\cal B})}$, we conclude that ${(f_n)_{n=1}^\infty}$ is a dense set in ${L^1(X,{\cal B})}$. It follows from Proposition 5 that the function ${y\mapsto\int_Xf(x)d\mu_y(x)}$ is in ${L^1(X,{\cal D})}$ for any ${f\in L^1(X,{\cal B})}$.

Finally for each ${n\in{\mathbb N}}$ we have

$\displaystyle \int_X\int_Xf_n(x)d\mu_y(x)d\mu(y)=\int_XL_y(f_n)d\mu(y)=\int_X\mathop{\mathbb E}[f_n\mid{\cal D}]d\mu=\int_Xf_nd\mu$

and since the sequence ${(f_n)_{n=1}^\infty}$ is a dense set in ${C(X)}$ we conclude that (3) holds for any ${f\in C(X)}$. Now given ${f\in L^1(X,{\cal B})}$, one can find ${g_1,g_2,\dots\in C(X)}$ such that ${f=\sum_{i=1}^\infty g_i}$ a.e. and ${\sum_{i=1}^\infty\|g_i\|_{L^1}<\infty}$. From Proposition 5 this implies that ${\sum_{i=1}^\infty\left\|\mathop{\mathbb E}\big[|g_i|\mid{\cal D}\big]\right\|<\infty}$ and hence ${\sum_{i=1}^\infty\mathop{\mathbb E}\big[|g_i|\mid{\cal D}\big]<\infty}$ a.e. For each ${x\in X}$ for which this series converges and ${f=\sum_{i=1}^\infty g_i}$ we have

$\displaystyle \int_X|f|d\mu_y=\int_X\left|\sum_{i=1}^\infty g_i\right|d\mu_y\leq\sum_{i=1}^\infty\int_X|g_i|d\mu_y=\sum_{i=1}^\infty\mathop{\mathbb E}\big[|g_i|\mid{\cal D}\big](y)<\infty$

We conclude that ${f\in L^1(X,{\cal B},\mu_y)}$. Moreover, since (3) holds for each ${g_i}$, it is easy to deduce that (3) holds for ${f}$. $\Box$

I am not sure how the last steps work. So $\int_X\int_Xf_n(x)d\mu_y(x)d\mu(y)$ is equal to $\int_Xf_nd\mu$. In order to prove this equality for all $f$, choose a subsequence $f_{n_k}$ that converges to $f$ in $L^1(X, \mu)$ sense. I see that $\int_Xf_{n_k}d\mu$ converges to $\int_Xf d\mu$. But I don’t see how $\int_X\int_Xf_{n_k}(x)d\mu_y(x)d\mu(y)$ converges to $\int_X\int_Xf(x)d\mu_y(x)d\mu(y)$.
I think what I had in mind was to prove that $\int_X\int_Xf_n(x)d\mu_y(x)d\mu(y)=\int_Xf_nd\mu$ first for every $f\in C(X)$ and then invoke Riesz representation theorem again. I will rewrite that part to make it more clear.