The notion of isomorphism in the category of measure preserving systems (defined, for instance, in this earlier post) is fairly flexible, as it allows measure perturbations. For instance, the system where is the Lebesgue measure and is isomorphic to the Bernoulli system , where is the left shift and is the infinite product of the measure on defined so that .
However, this flexibility can make it quite difficult to decide whether two given systems are isomorphic or not. While several spectral invariants can be used, such as ergodicity, weakly mixing or strongly mixing, they are not sufficient to distinguish even between two Bernoulli shifts. To address this problem, Kolmogorov and Sinai developed the notion of entropy in the late 1950’s. Since then, the entropy of measure preserving systems, together with its sister in topological dynamics, has became a fundamental tool in understanding the structure of dynamical systems, with applications far beyond distinguishing non-isomorphic systems.
In this post I will define and then briefly survey some of the basic properties of entropy. I’ll mostly follow the exposition in the upcoming book of Einsiedler, Lindenstrauss and Ward, whose draft is available here; other good sources are Walter’s book and, in a somewhat different perspective, Austin’s recent notes.
— 1. Defining entropy —
The definition of entropy can be a bit intimidating the first time one sees it. In later sections I try to present some ways to understand the notion of entropy, but I decided it would be useful to start with its raw definition.
The theory of entropy is often developed using partitions but I personally prefer to work with -algebras, and the following proposition says that there is really no difference.
Proposition 1 A -algebra in a space is finite if and only if it is generated by a finite partition of . Moreover, the finite partition is unique.
Proof: If there exists such a partition then is finite. Conversely, if is finite, let be the collection of all non-empty sets in such that any proper subset of is not in . The elements of form a partition of and the -algebra generated by can be easily shown to equal .
Since elements of are disjoint, consists of all unions of elements of . If is another partition generating , then for every and , the intersection is a subset of which belongs to , and hence is either equal to or empty; and also it’s equal to or empty. But it can not be empty for every because forms a partition, so we conclude that and hence and by symmetry .
Definition 2 Given two -algebras in a space , we denote by the -algebra generated by their union . More generally, given a collection of -algebras, we denote by the -algebra generated by .
Observe that if is the generating partition for and is the generating partition for , then the generating partition for consists of all non-empty sets of the form with and .
Next let be a measure preserving system. For each -algebra we denote by the -algebra consisting of all sets with . We are now ready to define the entropy of a measure preserving system.
- The entropy of is the number
Since as , we interpret as whenever .
The quantity is usually defined with a limit instead of an infimum. Indeed the sequence can be shown to be subadditive and hence the fact that (and in particular that the limit exists) follows from Fekete’s lemma (we will return to this point after Lemma 7 below).
— 2. Entropy as information —
The word “entropy” has been used in thermodynamics since the 1800’s, and was adapted by Shannon as a substitute for “information”. According to Shannon, the idea to use the word entropy came from von Neumman:
I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. […] Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”
I will now try to briefly explain why the quantity defined in Definition 3 can be seen as quantifying “information”. Consider the following scenario: Alice chooses a point in a set of cardinality and Bob has to guess by asking yes or no questions (which Alice answers truthfully). It’s not hard to see that Bob can always determine the value of if and only if (where means the base logarithm) by asking questions of the form “Is in ?” for subsets of cardinality . Therefore we say that determining the exact value of requires “bits” of information, and that determining whether is in a set of cardinality requires one bit of information. If has cardinality , then knowing that should provide bits of information. More generally, if has then knowing that provides bits of information.
Next suppose that we are given a function , and now Alice chooses randomly with the uniform distribution and tells Bob the value of . How much information is Alice giving to Bob? Let’s say takes only values and let . Thus, by knowing we are learning which of the sets contains , and so the amount of information is whenever . Since is now being chosen randomly, the average information is
where is the uniform probability measure on .
More generally, let be a probability space and let be a random variable which only takes finitely many values. Let be the (finite) -algebra generated by — i.e. the smallest -algebra such that the function is measurable. Then the quantity can be though of as the amount of information obtained from learning the value of for a randomly chosen point (with respect to ).
— 2.1. Conditional entropy —
Suppose now that again we have a probability space and two finitely valued random variables . If we already know the value of where is randomly chosen according to , how much new information (or entropy) will we learn from knowing also the value of ? A reasonable answer is that it should be the entropy of the random variable (which takes values in ) representing all the information obtained from knowing and , minus the entropy of . Intuitively, if and are independent, then knowing should in no way influence how much information is contained in . At the other end of the spectrum, if is a function of (i.e. for some ) then determines and so there is no new information. Indeed if and are independent, then the entropy of is precisely the sum of the entropies of and , and if the the entropy of is the same as the entropy of .
While random variables might seem an intuitive way to talk about information, it is actually easier (and somewhat less misleading) to think in terms of the -algebras they generate.
This definition is usually presented as a theorem, the usual definition being as follows: let be the generating partition for and, for each , let be the probability measure on obtained by conditioning on (i.e. ). Then
A straightforward computation shows that the two definitions agree, the advantage of the definition given in Definition 4 is that it does not make explicit use of the partition which generates . On the other hand, the alternative definition given by (1) implies in particular that , and hence that
The following proposition confirms the natural heuristics from the beginning of this subsection.
Let and be the generating partitions for and respectively. Notice that the nonempty sets of the form with , form the generating partition for . Let and notice that for all , which implies that is convex. In the following computation we use Jensen’s inequality in this form and the identity . We have
The following corollary confirms two other properties that one would expect from thinking of entropy as information. The first of these, (5), states roughly speaking that the more we already know, the less we learn from . The second, (6), says essentially that regardless of what we already know, the amount of information contained in is at most the sum of the information of and of (which is the conditional version of (3)).
Proof: From Definition 4 we have
— 2.2. Dynamical entropy —
Let be a measure preserving system and let be a finite -algebra. Lemma 7 implies that each of the (finite) -algebras have the same entropy. Therefore, if all those -algebras are independent, it follows from Proposition 5 that the entropy of is . Often it is the case that they are not independent, in which case the entropy of is smaller than , again by Proposition 5. In any case,
showing that the function is subadditive and hence, in view of Fekete’s lemma, the infimum in the definition of is also the limit of as .
Recall that can be interpreted as the average information needed to know which atom of contains a randomly chosen point . Therefore represents the average information needed to know in which atom of each of the points are. On the other hand, using Definition 4, we can write
Then, using Lemma 7 and re-indexing we deduce that
From Corollary 6 the (non-negative) quantity decreases as increases and so it converges. Therefore
which we can interpret as the average information needed to know which atom of contains a randomly chosen knowing which atom contains each of the future iterates . More succinctly (but somewhat less accurately), one could say that entropy is the amount of information needed to recover the present from the future.
— 3. The Kolmogorov-Sinai theorem —
In the previous two sections we defined the entropy of measure preserving systems and interpreted it in terms of information, but we are still unable to compute the entropy of even simple systems, mainly because the entropy is defined as the supremum of over all finite -algebras. The Kolmogorov-Sinai theorem addresses precisely this difficulty, at least in the case when a finite generator exists (which is often the case).
In order to prove Theorem 8 it will be convenient to first establish a few lemmas.
Dividing by and taking the limit as we obtain the desired conclusion.
We are now ready to prove Theorem 8.
Proof of Theorem 8: Assume that is a one-sided generator (the two-sided case can be proved analogously). Let be any finite -algebra; we need to show that . In view of Lemma 10 this is equivalent to show that . Using Lemma 9 it thus suffices to show that
Observe that the family
is a -algebra containing for every , and hence it equals .
Let be the generating partition for and, for each and , let be the set that minimizes . After changing each by measure if needed we can assume that the collection forms a partition of . Let be the -algebra generated by that partition. From (5) it follows that
A quick computation shows that as , finishing the proof.
— 4. Examples —
We are finally ready to start computing the entropy of some concrete systems.
Example 1 (Rotation) Let , let denote the Borel -algebra on , let be the Lebesgue measure and let , where . Then the .
Proof: If is rational, then for some we have that is the identity and hence for any finite -algebra and every , , so clearly . If is irrational, then the -algebra generated by the sets and satisfies (for it contains arbitrarily small intervals around every point). Since the system is invertible, Corollary 11 implies that .
Example 2 (Two-sided Bernoulli shift) Let be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight . Let , endowed with the product topology, let be the Borel -algebra, let be the product measure (denoted perhaps by ) and let be the shift map.
Then the entropy of is .
Proof: We will use the finite -algebra generated by the family . This is a (two sided) generator and hence
It is clear that is independent from for every ; therefore
Pretty much the same proof can be used for one sided Bernoulli shifts, although now the generator is one-sided.
Example 3 (One-sided Bernoulli shift) As above, let be a finite set, endowed with the discrete topology and a Borel probability measure given by the weight . Let , endowed with the product topology, let be the Borel -algebra, let be the product measure and let be the shift map.
Then the entropy of is .
Notice that the Bernoulli systems defined in Example 2 are invertible and the ones defined in Example 3 have a one-sided generator. Nevertheless they both have positive entropy. This shows that both hypothesis in Corollary 11 are necessary to obtain entropy.
Example 4 (Sub-shifts) Let be a measure preserving system where, as before, , for some finite alphabet , is the Borel -algebra (for the product topology) and is the shift map.
For each finite set and every function , denote by the cylinder set
From the Kolmogorov extension theorem, the measure is defined by the values of for every cylinder set, and so it is not surprising that we can find the entropy of the system as a function of the values .
The finite -algebra generated by the cylinder sets with is a one-sided generator, and hence . Observe that for every , the generating partition for is the collection of all cylinder sets of the form . Therefore