A measure preserving system is a quadruple where is a set, is a -algebra, is a probability measure and is a measurable map satisfying for every .

The notion of isomorphism in the category of measure preserving systems (defined, for instance, in this earlier post) is fairly flexible, as it allows measure perturbations. For instance, the system where is the Lebesgue measure and is isomorphic to the Bernoulli system , where is the left shift and is the infinite product of the measure on defined so that .

However, this flexibility can make it quite difficult to decide whether two given systems are isomorphic or not. While several spectral invariants can be used, such as ergodicity, weakly mixing or strongly mixing, they are not sufficient to distinguish even between two Bernoulli shifts. To address this problem, Kolmogorov and Sinai developed the notion of entropy in the late 1950’s. Since then, the entropy of measure preserving systems, together with its sister in topological dynamics, has became a fundamental tool in understanding the structure of dynamical systems, with applications far beyond distinguishing non-isomorphic systems.

In this post I will define and then briefly survey some of the basic properties of entropy. I’ll mostly follow the exposition in the upcoming book of Einsiedler, Lindenstrauss and Ward, whose draft is available here; other good sources are Walter’s book and, in a somewhat different perspective, Austin’s recent notes.

** — 1. Defining entropy — **

The definition of entropy can be a bit intimidating the first time one sees it. In later sections I try to present some ways to *understand* the notion of entropy, but I decided it would be useful to start with its raw definition.

The theory of entropy is often developed using partitions but I personally prefer to work with -algebras, and the following proposition says that there is really no difference.

Proposition 1A -algebra in a space is finite if and only if it is generated by a finite partition of . Moreover, the finite partition is unique.

*Proof:* If there exists such a partition then is finite. Conversely, if is finite, let be the collection of all non-empty sets in such that any proper subset of is not in . The elements of form a partition of and the -algebra generated by can be easily shown to equal .

Since elements of are disjoint, consists of all unions of elements of . If is another partition generating , then for every and , the intersection is a subset of which belongs to , and hence is either equal to or empty; and also it’s equal to or empty. But it can not be empty for every because forms a partition, so we conclude that and hence and by symmetry .

Definition 2Given two -algebras in a space , we denote by the -algebra generated by their union . More generally, given a collection of -algebras, we denote by the -algebra generated by .

Observe that if is the generating partition for and is the generating partition for , then the generating partition for consists of all non-empty sets of the form with and .

Next let be a measure preserving system. For each -algebra we denote by the -algebra consisting of all sets with . We are now ready to define the entropy of a measure preserving system.

Definition 3Let be a measure preserving system, let be a finite -algebra and let be its generating partition.

- The entropy of is the number
Since as , we interpret as whenever .

The quantity is usually defined with a limit instead of an infimum. Indeed the sequence can be shown to be subadditive and hence the fact that (and in particular that the limit exists) follows from Fekete’s lemma (we will return to this point after Lemma 7 below).

** — 2. Entropy as information — **

The word “entropy” has been used in thermodynamics since the 1800’s, and was adapted by Shannon as a substitute for “information”. According to Shannon, the idea to use the word entropy came from von Neumman:

*I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. […] Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”*

I will now try to briefly explain why the quantity defined in Definition 3 can be seen as quantifying “information”. Consider the following scenario: Alice chooses a point in a set of cardinality and Bob has to guess by asking yes or no questions (which Alice answers truthfully). It’s not hard to see that Bob can always determine the value of if and only if (where means the base logarithm) by asking questions of the form “Is in ?” for subsets of cardinality . Therefore we say that determining the exact value of requires “bits” of information, and that determining whether is in a set of cardinality requires one bit of information. If has cardinality , then knowing that should provide bits of information. More generally, if has then knowing that provides bits of information.

Next suppose that we are given a function , and now Alice chooses randomly with the uniform distribution and tells Bob the value of . How much information is Alice giving to Bob? Let’s say takes only values and let . Thus, by knowing we are learning which of the sets contains , and so the amount of information is whenever . Since is now being chosen randomly, the average information is

where is the uniform probability measure on .

More generally, let be a probability space and let be a random variable which only takes finitely many values. Let be the (finite) -algebra generated by — i.e. the smallest -algebra such that the function is measurable. Then the quantity can be though of as the amount of information obtained from learning the value of for a randomly chosen point (with respect to ).

** — 2.1. Conditional entropy — **

Suppose now that again we have a probability space and two finitely valued random variables . If we already know the value of where is randomly chosen according to , how much *new* information (or entropy) will we learn from knowing also the value of ? A reasonable answer is that it should be the entropy of the random variable (which takes values in ) representing all the information obtained from knowing and , minus the entropy of . Intuitively, if and are independent, then knowing should in no way influence how much information is contained in . At the other end of the spectrum, if is a function of (i.e. for some ) then determines and so there is no new information. Indeed if and are independent, then the entropy of is precisely the sum of the entropies of and , and if the the entropy of is the same as the entropy of .

While random variables might seem an intuitive way to talk about information, it is actually easier (and somewhat less misleading) to think in terms of the -algebras they generate.

Definition 4Let be a probability space and let and be finite -subalgebras of . We define theconditional entropyof given to be the quantity

This definition is usually presented as a theorem, the usual definition being as follows: let be the generating partition for and, for each , let be the probability measure on obtained by conditioning on (i.e. ). Then

A straightforward computation shows that the two definitions agree, the advantage of the definition given in Definition 4 is that it does not make explicit use of the partition which generates . On the other hand, the alternative definition given by (1) implies in particular that , and hence that

The following proposition confirms the natural heuristics from the beginning of this subsection.

Proposition 5Let be a probability space and let be finite -algebras. Then

with equality occurring if and only if and are independent (meaning that whenever and ).

*Proof:*

Notice that from Definition 4 if follows directly that(3) and (4) are equivalent.

Let and be the generating partitions for and respectively. Notice that the nonempty sets of the form with , form the generating partition for . Let and notice that for all , which implies that is convex. In the following computation we use Jensen’s inequality in this form and the identity . We have

The following corollary confirms two other properties that one would expect from thinking of entropy as information. The first of these, (5), states roughly speaking that the more we already know, the less we learn from . The second, (6), says essentially that regardless of what we already know, the amount of information contained in is at most the sum of the information of and of (which is the conditional version of (3)).

Corollary 6let be a probability space and let be finite -algebras. Then

*Proof:* From Definition 4 we have

and so (5) follows from (6). Let be the positive measure atoms of and let be the conditional measure on , for each . Proposition 5 now implies that

** — 2.2. Dynamical entropy — **

Lemma 7Let be a measure preserving system and let be finite -subalgebras. Then

*Proof:* The first equality follows directly from the Definition 3 and the fact that preserves the measure. The second equality follows from the first and Definition 4.

Let be a measure preserving system and let be a finite -algebra. Lemma 7 implies that each of the (finite) -algebras have the same entropy. Therefore, if all those -algebras are independent, it follows from Proposition 5 that the entropy of is . Often it is the case that they are not independent, in which case the entropy of is smaller than , again by Proposition 5. In any case,

showing that the function is subadditive and hence, in view of Fekete’s lemma, the infimum in the definition of is also the limit of as .

Recall that can be interpreted as the average information needed to know which atom of contains a randomly chosen point . Therefore represents the average information needed to know in which atom of each of the points are. On the other hand, using Definition 4, we can write

and so can be though of the average information needed to know which atom of contains the point , knowing which atom contains each of the points (and where and are now both being randomly chosen).

In a different direction, instead of decomposing as a telescoping sum as in (7) (essentially conditioning on the past), we can also use Definition 4 to write

Then, using Lemma 7 and re-indexing we deduce that

From Corollary 6 the (non-negative) quantity decreases as increases and so it converges. Therefore

which we can interpret as the average information needed to know which atom of contains a randomly chosen knowing which atom contains each of the future iterates . More succinctly (but somewhat less accurately), one could say that entropy is the amount of information needed to recover the present from the future.

** — 3. The Kolmogorov-Sinai theorem — **

In the previous two sections we defined the entropy of measure preserving systems and interpreted it in terms of information, but we are still unable to compute the entropy of even simple systems, mainly because the entropy is defined as the supremum of over *all* finite -algebras. The Kolmogorov-Sinai theorem addresses precisely this difficulty, at least in the case when a finite generator exists (which is often the case).

Theorem 8Let be a measure preserving system and let be a one-sided generator i.e., a finite -algebra satisfyingMoreover, if the system is invertible and is a two-sided generator, i.e. it satisfies

then

In order to prove Theorem 8 it will be convenient to first establish a few lemmas.

Lemma 9Let be a measure preserving system and let be two finite -algebras. Then

*Proof:* We use, in this order, (2), then Definition 4, then (6), then (5) and finally Lemma 7 to estimate

Dividing by and taking the limit as we obtain the desired conclusion.

Lemma 10Let be a measure preserving system and let be a finite -algebra. Then for every ,

*Proof:* Using (2) and Definition 3, it follows that . To prove the converse inequality we use the fact that whenever together with (8) and (5) to obtain

We are now ready to prove Theorem 8.

*Proof of Theorem 8:* Assume that is a one-sided generator (the two-sided case can be proved analogously). Let be any finite -algebra; we need to show that . In view of Lemma 10 this is equivalent to show that . Using Lemma 9 it thus suffices to show that

Observe that the family

is a -algebra containing for every , and hence it equals .

Let be the generating partition for and, for each and , let be the set that minimizes . After changing each by measure if needed we can assume that the collection forms a partition of . Let be the -algebra generated by that partition. From (5) it follows that

A quick computation shows that as , finishing the proof.

Corollary 11Let be an invertible measure preserving system and assume that it admits a one sided finite generator . Then .

*Proof:* Using Theorem 8, then (8), then Lemma 7 and finally the fact whenever , we conclude that

** — 4. Examples — **

We are finally ready to start computing the entropy of some concrete systems.

Example 1 (Rotation)Let , let denote the Borel -algebra on , let be the Lebesgue measure and let , where . Then the .

*Proof:* If is rational, then for some we have that is the identity and hence for any finite -algebra and every , , so clearly . If is irrational, then the -algebra generated by the sets and satisfies (for it contains arbitrarily small intervals around every point). Since the system is invertible, Corollary 11 implies that .

Example 2 (Two-sided Bernoulli shift)Let be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight . Let , endowed with the product topology, let be the Borel -algebra, let be the product measure (denoted perhaps by ) and let be the shift map.Then the entropy of is .

*Proof:* We will use the finite -algebra generated by the family . This is a (two sided) generator and hence

It is clear that is independent from for every ; therefore

Pretty much the same proof can be used for one sided Bernoulli shifts, although now the generator is one-sided.

Example 3 (One-sided Bernoulli shift)As above, let be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight . Let , endowed with the product topology, let be the Borel -algebra, let be the product measure and let be the shift map.Then the entropy of is .

Notice that the Bernoulli systems defined in Example 2 are invertible and the ones defined in Example 3 have a one-sided generator. Nevertheless they both have positive entropy. This shows that both hypothesis in Corollary 11 are necessary to obtain entropy.

Example 4 (Sub-shifts)Let be a measure preserving system where, as before, , for some finite alphabet , is the Borel -algebra (for the product topology) and is the shift map.For each finite set and every function , denote by the

cylinder setFrom the Kolmogorov extension theorem, the measure is defined by the values of for every cylinder set, and so it is not surprising that we can find the entropy of the system as a function of the values .

The finite -algebra generated by the cylinder sets with is a one-sided generator, and hence . Observe that for every , the generating partition for is the collection of all cylinder sets of the form . Therefore

and so

Pingback: Tao’s Proof of (logarithmically averaged) Chowla’s conjecture for two point correlations | I Can't Believe It's Not Random!