## Entropy of measure preserving systems

A measure preserving system is a quadruple ${(X,{\mathcal B},\mu,T)}$ where ${X}$ is a set, ${{\mathcal B}}$ is a ${\sigma}$-algebra, ${\mu:{\mathcal B}\rightarrow[0,1]}$ is a probability measure and ${T:X\rightarrow X}$ is a measurable map satisfying ${\mu(T^{-1}A)=\mu(A)}$ for every ${A\in{\mathcal B}}$.

The notion of isomorphism in the category of measure preserving systems (defined, for instance, in this earlier post) is fairly flexible, as it allows ${0}$ measure perturbations. For instance, the system ${([0,1],{\mathcal B},\lambda,T_2)}$ where ${\lambda}$ is the Lebesgue measure and ${T_2x=2x\bmod1}$ is isomorphic to the Bernoulli system ${(\{0,1\}^{\mathbb N},\mu,\sigma)}$, where ${\sigma:\{0,1\}^{\mathbb N}\rightarrow\{0,1\}^{\mathbb N}}$ is the left shift and ${\mu}$ is the infinite product of the measure ${p}$ on ${\{0,1\}}$ defined so that ${p(\{0\})=p(\{1\})=1/2}$.

However, this flexibility can make it quite difficult to decide whether two given systems are isomorphic or not. While several spectral invariants can be used, such as ergodicity, weakly mixing or strongly mixing, they are not sufficient to distinguish even between two Bernoulli shifts. To address this problem, Kolmogorov and Sinai developed the notion of entropy in the late 1950’s. Since then, the entropy of measure preserving systems, together with its sister in topological dynamics, has became a fundamental tool in understanding the structure of dynamical systems, with applications far beyond distinguishing non-isomorphic systems.

In this post I will define and then briefly survey some of the basic properties of entropy. I’ll mostly follow the exposition in the upcoming book of Einsiedler, Lindenstrauss and Ward, whose draft is available here; other good sources are Walter’s book and, in a somewhat different perspective, Austin’s recent notes.

— 1. Defining entropy —

The definition of entropy can be a bit intimidating the first time one sees it. In later sections I try to present some ways to understand the notion of entropy, but I decided it would be useful to start with its raw definition.

The theory of entropy is often developed using partitions but I personally prefer to work with ${\sigma}$-algebras, and the following proposition says that there is really no difference.

Proposition 1 A ${\sigma}$-algebra ${\xi}$ in a space ${X}$ is finite if and only if it is generated by a finite partition of ${X}$. Moreover, the finite partition is unique.

Proof: If there exists such a partition then ${\xi}$ is finite. Conversely, if ${\xi}$ is finite, let ${{\mathcal A}}$ be the collection of all non-empty sets ${A}$ in ${\xi}$ such that any proper subset of ${A}$ is not in ${\xi}$. The elements of ${{\mathcal A}}$ form a partition of ${X}$ and the ${\sigma}$-algebra generated by ${{\mathcal A}}$ can be easily shown to equal ${\xi}$.

Since elements of ${{\mathcal A}}$ are disjoint, ${\xi}$ consists of all unions of elements of ${{\mathcal A}}$. If ${\tilde{\mathcal A}}$ is another partition generating ${\xi}$, then for every ${A\in{\mathcal A}}$ and ${\tilde A\in\tilde{\mathcal A}}$, the intersection is a subset of ${A}$ which belongs to ${\xi}$, and hence ${A\cap\tilde A}$ is either equal to ${A}$ or empty; and also it’s equal to ${\tilde A}$ or empty. But it can not be empty for every ${A\in{\mathcal A}}$ because ${{\mathcal A}}$ forms a partition, so we conclude that ${\tilde A\in {\mathcal A}}$ and hence ${\tilde {\mathcal A}\subset{\mathcal A}}$ and by symmetry ${\tilde {\mathcal A}={\mathcal A}}$.

$\Box$

Definition 2 Given two ${\sigma}$-algebras ${\xi,\eta}$ in a space ${X}$, we denote by ${\xi\vee \eta}$ the ${\sigma}$-algebra generated by their union ${\xi\cup\eta}$. More generally, given a collection ${(\xi_i)_{i\in I}}$ of ${\sigma}$-algebras, we denote by ${\bigvee_{i\in I}\xi_i}$ the ${\sigma}$-algebra generated by ${\bigcup_{i\in I}\xi_i}$.

Observe that if ${{\mathcal A}}$ is the generating partition for ${\xi}$ and ${{\mathcal B}}$ is the generating partition for ${\eta}$, then the generating partition for ${\xi\vee\eta}$ consists of all non-empty sets of the form ${A\cap B}$ with ${A\in{\mathcal A}}$ and ${B\in{\mathcal B}}$.

Next let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system. For each ${\sigma}$-algebra ${\xi\subset{\mathcal B}}$ we denote by ${T^{-1}\xi}$ the ${\sigma}$-algebra consisting of all sets ${T^{-1}A}$ with ${A\in\xi}$. We are now ready to define the entropy of a measure preserving system.

Definition 3 Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system, let ${\xi\subset{\mathcal B}}$ be a finite ${\sigma}$-algebra and let ${{\mathcal A}\subset\xi}$ be its generating partition.

• The entropy of ${\xi}$ is the number

$\displaystyle H_\mu(\xi)=-\sum_{A\in{\mathcal A}}\mu(A)\log\big(\mu(A)\big).$

Since ${x\log x\rightarrow0}$ as ${x\rightarrow0}$, we interpret ${\mu(A)\log\big(\mu(A)\big)}$ as ${0}$ whenever ${\mu(A)=0}$.

• ${\displaystyle h_\mu(T,\xi)=\inf_{n\geq1}\frac1n H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$
• ${\displaystyle h_\mu(T)=\sup\big\{h_\mu(T,\xi): \xi\subset{\mathcal B}\text{ is a finite }\sigma\text{-algebra}\big\}}$

The quantity ${h_\mu(T,\xi)}$ is usually defined with a limit instead of an infimum. Indeed the sequence ${n\mapsto H_\mu\left(\bigvee_{i=0}^nT^{-i}\xi\right)}$ can be shown to be subadditive and hence the fact that ${ h_\mu(T,\xi)=\lim_{n\rightarrow\infty}\frac1n H_\mu\left(\bigvee_{i=0}^nT^{-i}\xi\right)}$ (and in particular that the limit exists) follows from Fekete’s lemma (we will return to this point after Lemma 7 below).

— 2. Entropy as information —

The word “entropy” has been used in thermodynamics since the 1800’s, and was adapted by Shannon as a substitute for “information”. According to Shannon, the idea to use the word entropy came from von Neumman:

I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. […] Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

I will now try to briefly explain why the quantity defined in Definition 3 can be seen as quantifying “information”. Consider the following scenario: Alice chooses a point ${x}$ in a set ${X}$ of cardinality ${|X|=n}$ and Bob has to guess ${x}$ by asking ${k}$ yes or no questions (which Alice answers truthfully). It’s not hard to see that Bob can always determine the value of ${x}$ if and only if ${k\geq\log n}$ (where ${\log}$ means the base ${2}$ logarithm) by asking questions of the form “Is ${x}$ in ${A}$?” for subsets ${A\subset X}$ of cardinality ${n/2}$. Therefore we say that determining the exact value of ${x}$ requires ${\log n}$ “bits” of information, and that determining whether ${x}$ is in a set ${A\subset X}$ of cardinality ${n/2}$ requires one bit of information. If ${A\subset X}$ has cardinality ${|A|=n/4}$, then knowing that ${x\in A}$ should provide ${2}$ bits of information. More generally, if ${A\subset X}$ has ${|A|=k}$ then knowing that ${x\in A}$ provides ${\log(n/k)}$ bits of information.

Next suppose that we are given a function ${f:X\rightarrow{\mathbb R}}$, and now Alice chooses ${x\in X}$ randomly with the uniform distribution and tells Bob the value of ${f(x)}$. How much information is Alice giving to Bob? Let’s say ${f}$ takes only ${k}$ values ${\{y_1,\dots,y_k\}}$ and let ${A_i=f^{-1}(y_i)=\{x\in X:f(x)=y_i\}}$. Thus, by knowing ${f(x)}$ we are learning which of the sets ${A_i}$ contains ${x}$, and so the amount of information is ${\log(|X|/|A_i|)}$ whenever ${x\in A_i}$. Since ${x}$ is now being chosen randomly, the average information is

$\displaystyle \sum_{i=1}^k\frac{|A_i|}{|X|}\log\left(\frac{|X|}{|A_i|}\right)=-\sum_{i=1}^k\mu(A_i)\log\big(\mu(A_i)\big)$

where ${\mu}$ is the uniform probability measure on ${X}$.

More generally, let ${(X,{\mathcal B},\mu)}$ be a probability space and let ${f:X\rightarrow{\mathbb R}}$ be a random variable which only takes finitely many values. Let ${\xi}$ be the (finite) ${\sigma}$-algebra generated by ${f}$ — i.e. the smallest ${\sigma}$-algebra such that the function ${f:(X,\xi)\rightarrow({\mathbb R},\text{Borel})}$ is measurable. Then the quantity ${H_\mu(\xi)}$ can be though of as the amount of information obtained from learning the value of ${f(x)}$ for a randomly chosen point ${x\in X}$ (with respect to ${\mu}$).

— 2.1. Conditional entropy —

Suppose now that again we have a probability space ${(X,{\mathcal B},\mu)}$ and two finitely valued random variables ${f,g:X\rightarrow{\mathbb R}}$. If we already know the value of ${f(x)}$ where ${x\in X}$ is randomly chosen according to ${\mu}$, how much new information (or entropy) will we learn from knowing also the value of ${g(x)}$? A reasonable answer is that it should be the entropy of the random variable ${(f,g)}$ (which takes values in ${{\mathbb R}^2}$) representing all the information obtained from knowing ${f(x)}$ and ${g(x)}$, minus the entropy of ${f}$. Intuitively, if ${f}$ and ${g}$ are independent, then knowing ${f(x)}$ should in no way influence how much information is contained in ${g(x)}$. At the other end of the spectrum, if ${g}$ is a function of ${f}$ (i.e. ${g=h\circ f}$ for some ${h:{\mathbb R}\rightarrow{\mathbb R}}$) then ${f(x)}$ determines ${g(x)}$ and so there is no new information. Indeed if ${f}$ and ${g}$ are independent, then the entropy of ${(f,g)}$ is precisely the sum of the entropies of ${f}$ and ${g}$, and if ${g=h\circ f}$ the the entropy of ${(f,g)}$ is the same as the entropy of ${f}$.

While random variables might seem an intuitive way to talk about information, it is actually easier (and somewhat less misleading) to think in terms of the ${\sigma}$-algebras they generate.

Definition 4 Let ${(X,{\mathcal B},\mu)}$ be a probability space and let ${\xi}$ and ${\eta}$ be finite ${\sigma}$-subalgebras of ${{\mathcal B}}$. We define the conditional entropy of ${\xi}$ given ${\eta}$ to be the quantity

$\displaystyle H_\mu(\xi|\eta):=H_\mu(\xi\vee\eta)- H_\mu(\eta)$

This definition is usually presented as a theorem, the usual definition being as follows: let ${{\mathcal A}=\{A_1,\dots,A_n\}}$ be the generating partition for ${\eta}$ and, for each ${i=1,\dots,n}$, let ${\mu_i}$ be the probability measure on ${{\mathcal B}}$ obtained by conditioning ${\mu}$ on ${A_i}$ (i.e. ${\mu_i(B)=\mu(B\cap A_i)/\mu(A_i)}$). Then

$\displaystyle H_\mu(\xi|\eta)=\sum_{i=1}^n\mu(A_i)H_{\mu_i}(\xi). \ \ \ \ \ (1)$

A straightforward computation shows that the two definitions agree, the advantage of the definition given in Definition 4 is that it does not make explicit use of the partition which generates ${\eta}$. On the other hand, the alternative definition given by (1) implies in particular that ${H_\mu(\xi|\eta)\geq0}$, and hence that

$\displaystyle H_\mu(\eta)\leq H_\mu(\xi)\quad\text{ whenever }\quad\eta\subset\xi. \ \ \ \ \ (2)$

The following proposition confirms the natural heuristics from the beginning of this subsection.

Proposition 5 Let ${(X,{\mathcal B},\mu)}$ be a probability space and let ${\xi,\eta\subset{\mathcal B}}$ be finite ${\sigma}$-algebras. Then

$\displaystyle H_\mu(\xi\vee\eta)\leq H_\mu(\xi)+H_\mu(\eta)\ \ \ \ \ (3)$

and

$\displaystyle H_\mu(\xi|\eta)\leq H_\mu(\xi),\ \ \ \ \ (4)$

with equality occurring if and only if ${\xi}$ and ${\eta}$ are independent (meaning that ${\mu(A\cap B)=\mu(A)\mu(B)}$ whenever ${A\in\xi}$ and ${B\in\eta}$).

Proof:

Notice that from Definition 4 if follows directly that(3) and (4) are equivalent.

Let ${\{A_1,\dots,A_n\}}$ and ${\{B_1,\dots,B_m\}}$ be the generating partitions for ${\xi}$ and ${\eta}$ respectively. Notice that the nonempty sets of the form ${A_i\cap B_j}$ with ${i=1,\dots,n}$, ${j=1,\dots,m}$ form the generating partition for ${\xi\vee\eta}$. Let ${\phi(x)=x\log x}$ and notice that ${\phi''(x)=1/x>0}$ for all ${x\in(0,1]}$, which implies that ${\phi}$ is convex. In the following computation we use Jensen’s inequality in this form and the identity ${\mu(A)=\sum_{j=1}^m\mu(A\cap B_j)}$. We have

$\displaystyle \begin{array}{rcl} \displaystyle H_\mu(\xi\mid\eta) &=& \displaystyle -\sum_{i=1}^n\sum_{j=1}^m\mu(A_i\cap B_j)\log\big(\mu(A_i\cap B_j)\big) -H_\mu(\eta) \\&=& \displaystyle -\sum_{i=1}^n\sum_{j=1}^m\mu(A_i\cap B_j)\left[\log\left(\frac{\mu(A_i\cap B_j)}{\mu(B_j)}\right) +\log\big(\mu(B_j)\big)\right]-H_\mu(\eta) \\&=& \displaystyle -\sum_{i=1}^n\sum_{j=1}^m\mu(B_j)\phi\left(\frac{\mu(A_i\cap B_j)}{\mu(B_j)}\right) \\&\leq& \displaystyle -\sum_{i=1}^n\phi\left(\sum_{j=1}^m\mu(B_j)\frac{\mu(A_i\cap B_j)}{\mu(B_j)}\right) \\&=& H_\mu(\xi). \end{array}$

$\Box$

The following corollary confirms two other properties that one would expect from thinking of entropy as information. The first of these, (5), states roughly speaking that the more we already know, the less we learn from ${\xi}$. The second, (6), says essentially that regardless of what we already know, the amount of information contained in ${\xi\vee\eta}$ is at most the sum of the information of ${\xi}$ and of ${\eta}$ (which is the conditional version of (3)).

Corollary 6 let ${(X,{\mathcal B},\mu)}$ be a probability space and let ${\xi,\eta_1,\eta_2\subset{\mathcal B}}$ be finite ${\sigma}$-algebras. Then

$\displaystyle H_\mu(\xi\mid\eta_1\vee\eta_2)\leq H_\mu(\xi\mid\eta_1) \ \ \ \ \ (5)$

and

$\displaystyle H_\mu(\xi\vee\eta_2\mid\eta_1)\leq H_\mu(\xi\mid\eta_1)+H_\mu(\eta_2\mid\eta_1) \ \ \ \ \ (6)$

Proof: From Definition 4 we have

$\displaystyle H_\mu(\xi\mid\eta_1\vee\eta_2)=H_\mu(\xi\vee\eta_2\mid\eta_1)-H_\mu(\eta_2\mid\eta_1)$

and so (5) follows from (6). Let ${\{A_1,\dots,A_n\}}$ be the positive measure atoms of ${\eta_1}$ and let ${\mu_i}$ be the conditional measure on ${A_i}$, for each ${i=1,\dots,n}$. Proposition 5 now implies that

$\displaystyle \begin{array}{rcl} \displaystyle H_\mu(\xi\vee\eta_2\mid\eta_1) &=& \displaystyle \sum_{i=1}^n\mu(A_i)H_{\mu_i}(\xi\vee\eta_2) \\&\leq& \displaystyle \sum_{i=1}^n\mu(A_i)\Big(H_{\mu_i}(\xi)+H_{\mu_i}(\eta_2)\Big) \\&=& \displaystyle H_\mu(\xi\mid\eta_1)+H_\mu(\eta_2\mid\eta_1) \end{array}$

$\Box$

— 2.2. Dynamical entropy —

Lemma 7 Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system and let ${\xi,\eta\subset{\mathcal B}}$ be finite ${\sigma}$-subalgebras. Then

$\displaystyle H_\mu(T^{-1}\xi)=H_\mu(\xi)\qquad\text{and}\qquad H_\mu(T^{-1}\xi\mid T^{-1}\eta)=H_\mu(\xi\mid\eta)$

Proof: The first equality follows directly from the Definition 3 and the fact that ${T}$ preserves the measure. The second equality follows from the first and Definition 4. $\Box$

Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system and let ${\xi\subset{\mathcal B}}$ be a finite ${\sigma}$-algebra. Lemma 7 implies that each of the (finite) ${\sigma}$-algebras ${\xi,T^{-1}\xi,T^{-2}\xi,\dots}$ have the same entropy. Therefore, if all those ${\sigma}$-algebras are independent, it follows from Proposition 5 that the entropy of ${\bigvee_{i=0}^{n-1}T^{-i}\xi}$ is ${nH_\mu(\xi)}$. Often it is the case that they are not independent, in which case the entropy of ${\bigvee_{i=0}^{n-1}T^{-i}\xi}$ is smaller than ${nH_\mu(\xi)}$, again by Proposition 5. In any case,

$\displaystyle \begin{array}{rcl} \displaystyle H_\mu\left(\bigvee_{i=0}^{n+m-1}T^{-i}\xi\right)&\leq& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ H_\mu\left(\bigvee_{i=0}^{m-1}T^{-i-n}\xi\right) \\&=& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ H_\mu\left(\bigvee_{i=0}^{m-1}T^{-i}\xi\right) \end{array}$

showing that the function ${n\mapsto H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$ is subadditive and hence, in view of Fekete’s lemma, the infimum in the definition of ${h_\mu(T,\xi)}$ is also the limit of ${\frac1nH_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$ as ${n\rightarrow\infty}$.

Recall that ${H_\mu(\xi)}$ can be interpreted as the average information needed to know which atom of ${\xi}$ contains a randomly chosen point ${x\in X}$. Therefore ${H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$ represents the average information needed to know in which atom of ${\xi}$ each of the points ${x,Tx,T^2x,\dots,T^{n-1}x}$ are. On the other hand, using Definition 4, we can write

$\displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)=H_\mu(\xi)+\sum_{i=1}^{n-1}H_\mu\left(T^{-i}\xi~\bigg|~\bigvee_{j=0}^{i-1}T^{-j}\xi\right),\ \ \ \ \ (7)$

and so ${h_\mu(T,\xi)=\lim_{n\rightarrow\infty}\frac1nH_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$ can be though of the average information needed to know which atom of ${\xi}$ contains the point ${T^ix}$, knowing which atom contains each of the points ${x,Tx,\dots,T^{i-1}x}$ (and where ${x\in X}$ and ${i\in{\mathbb N}}$ are now both being randomly chosen).

In a different direction, instead of decomposing ${H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)}$ as a telescoping sum as in (7) (essentially conditioning on the past), we can also use Definition 4 to write

$\displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)=H_\mu(T^{-n+1}\xi)+\sum_{i=0}^{n-2}H_\mu\left(T^{-i}\xi~\bigg|~\bigvee_{j=i+1}^{n-1}T^{-j}\xi\right).$

Then, using Lemma 7 and re-indexing we deduce that

$\displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)= H_\mu(\xi)+\sum_{i=1}^{n-1}H_\mu\left(\xi~\bigg|~\bigvee_{j=1}^{i}T^{-j}\xi\right).$

From Corollary 6 the (non-negative) quantity ${H_\mu\left(\xi~\bigg|~\bigvee_{j=1}^{i}T^{-j}\xi\right)}$ decreases as ${i}$ increases and so it converges. Therefore

$\displaystyle h_\mu(T,\xi)=\lim_{n\rightarrow\infty}H_\mu\left(\xi~\bigg|~\bigvee_{i=1}^{n}T^{-i}\xi\right), \ \ \ \ \ (8)$

which we can interpret as the average information needed to know which atom of ${\xi}$ contains a randomly chosen ${x\in X}$ knowing which atom contains each of the future iterates ${Tx,T^2x,T^3x,\dots}$. More succinctly (but somewhat less accurately), one could say that entropy is the amount of information needed to recover the present from the future.

— 3. The Kolmogorov-Sinai theorem —

In the previous two sections we defined the entropy of measure preserving systems and interpreted it in terms of information, but we are still unable to compute the entropy of even simple systems, mainly because the entropy ${h_\mu(T)}$ is defined as the supremum of ${h_\mu(T,\xi)}$ over all finite ${\sigma}$-algebras. The Kolmogorov-Sinai theorem addresses precisely this difficulty, at least in the case when a finite generator exists (which is often the case).

Theorem 8 Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system and let ${\xi\subset{\mathcal B}}$ be a one-sided generator i.e., a finite ${\sigma}$-algebra satisfying

$\displaystyle \bigvee_{n=0}^\infty T^{-n}\xi={\mathcal B}\mod\mu\ \ \ \ \ (9)$

then ${h_\mu(T)=h_\mu(T,\xi)}$.

Moreover, if the system is invertible and ${\xi}$ is a two-sided generator, i.e. it satisfies

$\displaystyle \bigvee_{n=-\infty}^\infty T^{-n}\xi={\mathcal B}\mod\mu,$

then ${h_\mu(T)=h_\mu(T,\xi)}$

In order to prove Theorem 8 it will be convenient to first establish a few lemmas.

Lemma 9 Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system and let ${\xi,\eta\subset{\mathcal B}}$ be two finite ${\sigma}$-algebras. Then

$\displaystyle h_\mu(T,\eta)\leq h_\mu(T,\xi)+H_\mu(\eta\mid\xi).$

Proof: We use, in this order, (2), then Definition 4, then (6), then (5) and finally Lemma 7 to estimate

$\displaystyle \begin{array}{rcl} \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\eta\right) &\leq& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\eta~\vee~\bigvee_{i=0}^{n-1}T^{-i}\xi\right) \\&=& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\eta~\bigg|~\bigvee_{i=0}^{n-1}T^{-i}\xi\right) \\&\leq& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ \sum_{i=0}^{n-1}H_\mu\left(T^{-i}\eta~\bigg|~\bigvee_{j=0}^{n-1}T^{-j}\xi\right) \\&\leq& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ \sum_{i=0}^{n-1}H_\mu\left(T^{-i}\eta\mid T^{-i}\xi\right) \\&=& \displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)+ \sum_{i=0}^{n-1}H_\mu\left(\eta\mid \xi\right) \end{array}$

Dividing by ${n}$ and taking the limit as ${n\rightarrow\infty}$ we obtain the desired conclusion. $\Box$

Lemma 10 Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system and let ${\xi\subset{\mathcal B}}$ be a finite ${\sigma}$-algebra. Then for every ${n\in{\mathbb N}}$,

$\displaystyle h_\mu\left(T,\bigvee_{i=0}^nT^{-i}\xi\right)=h_\mu(T,\xi)$

Proof: Using (2) and Definition 3, it follows that ${h_\mu\left(T,\bigvee_{i=0}^nT^{-i}\xi\right)\geq h_\mu(T,\xi)}$. To prove the converse inequality we use the fact that ${H_\mu(\eta_1\mid\eta_2)=0}$ whenever ${\eta_1\subset\eta_2}$ together with (8) and (5) to obtain

$\displaystyle \begin{array}{rcl} \displaystyle h_\mu\left(T,\bigvee_{i=0}^nT^{-i}\xi\right) &=& \displaystyle \lim_{m\rightarrow\infty}H_\mu\left(\bigvee_{i=0}^nT^{-i}\xi~\bigg|~\bigvee_{j=1}^m\bigvee_{i=0}^nT^{-i-j}\xi\right) \\&=& \displaystyle \lim_{m\rightarrow\infty}H_\mu\left(\bigvee_{i=0}^nT^{-i}\xi~\bigg|~\bigvee_{j=1}^{n+m}T^{-j}\xi\right) \\&\leq& \displaystyle \lim_{m\rightarrow\infty}\sum_{i=0}^nH_\mu\left(T^{-i}\xi~\bigg|~\bigvee_{j=1}^{n+m}T^{-j}\xi\right) \\&=& \displaystyle \lim_{m\rightarrow\infty}H_\mu\left(\xi~\bigg|~\bigvee_{j=1}^{n+m}T^{-j}\xi\right)=h_\mu(T,\xi) \end{array}$

$\Box$

We are now ready to prove Theorem 8.

Proof of Theorem 8: Assume that ${\xi}$ is a one-sided generator (the two-sided case can be proved analogously). Let ${\eta\subset{\mathcal B}}$ be any finite ${\sigma}$-algebra; we need to show that ${h_\mu(T,\eta)\leq h_\mu(T,\xi)}$. In view of Lemma 10 this is equivalent to show that ${h_\mu(T,\eta)\leq \lim_{n\rightarrow\infty}h_\mu\left(T,\bigvee_{i=0}^nT^{-i}\xi\right)}$. Using Lemma 9 it thus suffices to show that

$\displaystyle \lim_{n\rightarrow\infty}H_\mu\left(\eta~\bigg|~\bigvee_{i=0}^nT^{-i}\xi\right)=0.$

Observe that the family

$\displaystyle \left\{A\in{\mathcal B}:(\forall\epsilon>0)(\exists n\in{\mathbb N})\left(\exists B\in \bigvee_{i=0}^nT^{-i}\xi\right):\mu(A\triangle B)<\epsilon\right\}$

is a ${\sigma}$-algebra containing ${T^{-n}\xi}$ for every ${n}$, and hence it equals ${{\mathcal B}\bmod\mu}$.

Let ${\{A_1,\dots,A_m\}}$ be the generating partition for ${\eta}$ and, for each ${n\in{\mathbb N}}$ and ${i=1,\dots,m}$, let ${B_{i,n}\in \bigvee_{i=0}^nT^{-i}\xi}$ be the set that minimizes ${\mu(A_i\triangle B_{i,n})}$. After changing each ${B_{i,n}}$ by ${o_{n\rightarrow\infty}(1)}$ measure if needed we can assume that the collection ${\{B_{1,n},\dots,B_{m,n}\}}$ forms a partition of ${X}$. Let ${\zeta_n\subset\bigvee_{i=0}^nT^{-i}\xi}$ be the ${\sigma}$-algebra generated by that partition. From (5) it follows that

$\displaystyle H_\mu\left(\eta~\bigg|~\bigvee_{i=0}^nT^{-i}\xi\right)\leq H_\mu(\eta\mid\zeta_n).$

A quick computation shows that ${H_\mu(\eta\mid\zeta_n)\rightarrow0}$ as ${n\rightarrow\infty}$, finishing the proof. $\Box$

Corollary 11 Let ${(X,{\mathcal B},\mu,T)}$ be an invertible measure preserving system and assume that it admits a one sided finite generator ${\xi}$. Then ${h_\mu(T)=0}$.

Proof: Using Theorem 8, then (8), then Lemma 7 and finally the fact ${H_\mu(\xi\mid\eta)=0}$ whenever ${\xi\subset\eta}$, we conclude that

$\displaystyle h_\mu(T)=h_\mu(T,\xi)=\lim_{n\rightarrow\infty}H_\mu\left(\xi~\bigg|~\bigvee_{i=1}^nT^{-i}\xi\right)=\lim_{n\rightarrow\infty}H_\mu\left(T\xi~\bigg|~\bigvee_{i=0}^{n-1}T^{-i}\xi\right)=0$

$\Box$

— 4. Examples —

We are finally ready to start computing the entropy of some concrete systems.

Example 1 (Rotation) Let ${X={\mathbb T}={\mathbb R}/{\mathbb Z}}$, let ${{\mathcal B}}$ denote the Borel ${\sigma}$-algebra on ${X}$, let ${\mu}$ be the Lebesgue measure and let ${T:x\mapsto x+\alpha\bmod1}$, where ${\alpha\in{\mathbb R}}$. Then the ${h_\mu(T)=0}$.

Proof: If ${\alpha}$ is rational, then for some ${n\in{\mathbb N}}$ we have that ${T^n}$ is the identity and hence for any finite ${\sigma}$-algebra ${\xi}$ and every ${m\geq n}$, ${\bigvee_{i=0}^mT^{-i}\xi=\bigvee_{i=0}^nT^{-i}\xi}$, so clearly ${h_\mu(T,\xi)=0}$. If ${\alpha}$ is irrational, then the ${\sigma}$-algebra ${\xi}$ generated by the sets ${[0,1/2)}$ and ${[1/2,1)}$ satisfies ${\bigvee_{n=0}^\infty T^{-n}\xi ={\mathcal B}\bmod\mu}$ (for it contains arbitrarily small intervals around every point). Since the system is invertible, Corollary 11 implies that ${h_\mu(T)=0}$. $\Box$

Example 2 (Two-sided Bernoulli shift) Let ${{\mathcal A}}$ be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight ${p:{\mathcal A}\rightarrow[0,1]}$. Let ${X={\mathcal A}^{\mathbb Z}}$, endowed with the product topology, let ${{\mathcal B}}$ be the Borel ${\sigma}$-algebra, let ${\mu}$ be the product measure (denoted perhaps by ${p^{\otimes{\mathbb Z}}}$) and let ${T:(x_n)_{n\in{\mathbb Z}}\mapsto(x_{n+1})_{n\in{\mathbb Z}}}$ be the shift map.

Then the entropy of ${(X,{\mathcal B},\mu,T)}$ is ${h_\mu(T)=-\sum_{a\in A}p(a)\log\big(p(a)\big)}$.

Proof: We will use the finite ${\sigma}$-algebra ${\xi}$ generated by the family ${\Big\{C_a=\{x\in X:x_0=a\}:a\in{\mathcal A}\Big\}}$. This is a (two sided) generator and hence

$\displaystyle h_\mu(T)=h_\mu(T,\xi)=\lim_{n\rightarrow\infty}H_\mu\left(\xi~\bigg|~\bigvee_{i=1}^n T^{-i}\xi\right)$

It is clear that ${\xi}$ is independent from ${\bigvee_{i=1}^n T^{-i}\xi}$ for every ${n\in{\mathbb N}}$; therefore

$\displaystyle H_\mu\left(\xi~\bigg|~\bigvee_{i=1}^n T^{-i}\xi\right)=H_\mu(\xi)=-\sum_{a\in A}p(a)\log\big(p(a)\big)$

$\Box$

Pretty much the same proof can be used for one sided Bernoulli shifts, although now the generator is one-sided.

Example 3 (One-sided Bernoulli shift) As above, let ${{\mathcal A}}$ be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight ${p:{\mathcal A}\rightarrow[0,1]}$. Let ${X={\mathcal A}^{\mathbb N}}$, endowed with the product topology, let ${{\mathcal B}}$ be the Borel ${\sigma}$-algebra, let ${\mu}$ be the product measure and let ${T:(x_n)_{n\in{\mathbb N}}\mapsto(x_{n+1})_{n\in{\mathbb N}}}$ be the shift map.

Then the entropy of ${(X,{\mathcal B},\mu,T)}$ is ${h_\mu(T)=-\sum_{a\in A}p(a)\log\big(p(a)\big)}$.

Notice that the Bernoulli systems defined in Example 2 are invertible and the ones defined in Example 3 have a one-sided generator. Nevertheless they both have positive entropy. This shows that both hypothesis in Corollary 11 are necessary to obtain ${0}$ entropy.

Example 4 (Sub-shifts) Let ${(X,{\mathcal B},\mu,T)}$ be a measure preserving system where, as before, ${X={\mathcal A}^{\mathbb N}}$, for some finite alphabet ${{\mathcal A}}$, ${{\mathcal B}}$ is the Borel ${\sigma}$-algebra (for the product topology) and ${T:X\rightarrow X}$ is the shift map.

For each finite set ${F\subset{\mathbb N}}$ and every function ${\omega\in{\mathcal A}^F}$, denote by ${C(F,\omega)}$ the cylinder set

$\displaystyle C(F,\omega)=\big\{x\in X:x_n=\omega_n)\text{ for all }n\in F\big\}$

From the Kolmogorov extension theorem, the measure is defined by the values of ${\mu(C(F,\omega))}$ for every cylinder set, and so it is not surprising that we can find the entropy of the system as a function of the values ${\mu(C(F,\omega))}$.

The finite ${\sigma}$-algebra ${\xi}$ generated by the cylinder sets ${C(\{1\},a)}$ with ${a\in{\mathcal A}}$ is a one-sided generator, and hence ${h_\mu(T)=h_\mu(T,\xi)}$. Observe that for every ${n\in{\mathbb N}}$, the generating partition for ${\bigvee_{i=0}^{n-1}T^{-i}\xi}$ is the collection of all cylinder sets of the form ${C(\{1,\dots,n\},\omega)}$. Therefore

$\displaystyle H_\mu\left(\bigvee_{i=0}^{n-1}T^{-i}\xi\right)= -\sum_{\omega\in{\mathcal A}^n}\mu\big(C(\{1,\dots,n\},\omega)\big)\log\Big(\mu\big(C(\{1,\dots,n\},\omega)\big)\Big)$

and so

$\displaystyle h_\mu(T)=-\lim_{n\rightarrow\infty}\frac1n\sum_{\omega\in{\mathcal A}^n}\mu\big(C(\{1,\dots,n\},\omega)\big)\log\Big(\mu\big(C(\{1,\dots,n\},\omega)\big)\Big)$