Tao’s Proof of (logarithmically averaged) Chowla’s conjecture for two point correlations

The Liouville function is the completely multiplicative function {\lambda} with {\lambda(p)=-1} for every prime {p}. The Chowla conjecture predicts that this function behaves randomly. Here is a version of this conjecture.

Conjecture 1 (Chowla) For every finite set {F\subset\{0,1,\dots\}},

\displaystyle \lim_{N\rightarrow\infty}\frac1N\sum_{n=1}^N\prod_{i\in F}\lambda(n+i)=0.

This conjecture received a lot of attention in the last decade, since Peter Sarnak derived from it an orthogonality criterion for the Liouville (or, similarly, the Mobius function), now known as the “Sarnak conjecture”. Both Chowla’s and Sarnak’s conjectures remain open, however many weaker versions and special cases have been derived (see, for instance, this survey). Perhaps the most impressive result so far is the following theorem of Tao, which was in turn the starting point for many of the more recent developments.

Theorem 2 (Tao) For every (distinct) {a,b\in{\mathbb N}},

\displaystyle \lim_{N/M\rightarrow\infty}\frac1{\log{(N/M)}}\sum_{n=M}^N\frac1n\lambda(n+a)\lambda(n+b)=0.

Tao’s main theorem in his paper is actually significantly more general than Theorem 2, for it allows more general affine maps inside the argument of {\lambda}, and in fact replaces {\lambda} with a much wider class of multiplicative functions. To see why Theorem 2 follows from Conjecture 1, observe that any sequence {x(n)} satisfying

\displaystyle \lim_{N\rightarrow\infty}\frac1N\sum_{n=1}^Nx(n)=0

must also satisfy

\displaystyle \lim_{N/M\rightarrow\infty}\frac1{\log{N/M}}\sum_{n=M}^N\frac1n x(n)=0.

In the terminology and notation of my earlier post, the conclusion of this theorem states that the uniform logarithmic average of the sequence {n\mapsto\lambda(n+a)\lambda(n+b)} is {0}, i.e.,

\displaystyle \underset{n\rightarrow\infty}{UW\text{-}\lim}~\lambda(n+a)\lambda(n+b)=0

for {W(n)=\log n}.

In this post I will present Tao’s proof restricted to the following special case. (Most of the proof remains unchanged in the general case, but the presentation becomes easier when we avoid general {a,b}).

Theorem 3 Let {\lambda} denote the Liouville function. Then

\displaystyle \lim_{N/M\rightarrow\infty}\frac1{\log(N/M)}\sum_{n=M}^N\frac1n\lambda(n)\lambda(n+1)=0

We will use a correspondence principle to reformulate the problem in the language of dynamical systems to, among other things, simplify the notation. This approach was also carried out by Tao in his blog, and the main motivation for me to write this post was to try to use an easier construction of a Furstenberg system, which would still keep track of the “congruence classes of the points” in the system, as detailed below the fold.

— 1. The correspondence principle —

Since the proof of Theorem 3 requires the use of both addition and multiplication (unlike, say, certain proofs of Szemerédi’s theorem in arithmetic progressions, which don’t use the multiplicative action of the integers on themselves), we can not simply quote the classical Furstenberg Correspondence principle, which translates the combinatorial problem into an ergodic theoretic situation which only allows one to use (the dynamical version of) addition. Instead we will build a measure preserving system from scratch using the Stone-Čech compactification {\beta{\mathbb N}} of the natural numbers (see my previous post for a construction of {\beta{\mathbb N}} using ultrafilters). The idea of obtaining a correspondence principle using {\beta{\mathbb N}} was used by Beigleböck in his short proof of Jin’s theorem (discussed in this previous post). I also adapted a similar idea in my paper on {\{x+y,xy\}} patterns to construct a dynamical system where both addition and multiplication were acting.

I will use the fact that {\beta{\mathbb N}} can be identified with the space of all ultrafilters {{\mathtt p}} over {{\mathbb N}}. Recall that an ultrafilter is a collection {{\mathtt p}} of subsets of {{\mathbb N}} with {{\mathbb N}\in{\mathtt p}} and satisfying the properties that if {A\in{\mathtt p}} and {A\subset B} then {B\in{\mathtt p}}, if {A,B\in{\mathtt p}} then {A\cap B\in{\mathtt p}} and {A\in {\mathtt p}\iff({\mathbb N}\setminus A)\notin{\mathtt p}}. Each natural number {n\in{\mathbb N}} gives rise to an ultrafilter, called a principal ultrafilter, defined by {{\mathtt p}_n:=\{A\subset{\mathbb N}:n\in A\}}. Any ultrafilter not of this form is called non-principal. We identify {{\mathbb N}} with the subset of {\beta{\mathbb N}} consisting of the principal ultrafilters in the obvious way.

The topology on {\beta{\mathbb N}} can be described as follows. For every {A\subset{\mathbb N}}, the collection of ultrafilters {\overline{A}:=\{{\mathtt p}\in\beta{\mathbb N}:A\in{\mathtt p}\}} is an open set. In fact open sets of this kind form a basis for the topology on {\beta{\mathbb N}}. It is clear that {\overline{A}=\beta{\mathbb N}\setminus\overline{{\mathbb N}\setminus A}}, which implies that {\overline{A}} is also closed. In fact it is the closure of {A} in {\beta{\mathbb N}}. Observe that {{\mathbb N}} is a dense subset of {\beta{\mathbb N}}.

With this topology, {\beta{\mathbb N}} is a compact Hausdorff space, so the Riesz Representation theorem applies and hence Radon measures on {\beta{\mathbb N}} can be identified with positive linear functionals on the space {C(\beta{\mathbb N})} of continuous functions on {\beta{\mathbb N}}. For each {n\in{\mathbb N}}, the point mass {\delta_n} is the probability measure on {\beta{\mathbb N}} defined by {\int_{\beta{\mathbb N}}f\,d\delta_n=f({\mathtt p}_n)} for every {f\in C(\beta{\mathbb N})}. Let {\mu_{M,N}} denote the logarithmic average of the measures {\delta_n} as {n} ranges over {[M,N]}. In other words

\displaystyle \mu_{M,N}:=\frac1{\log N/M}\sum_{n=M}^N\frac1n\delta_n.

Observe that, in order to establish Theorem 3 it suffices to show that for every sequences {(M_k)_{k\in{\mathbb N}},(N_k)_{k\in{\mathbb N}}}, with {N_k/M_k\rightarrow\infty}, the sequence

\displaystyle \frac1{\log N_k/M_k}\sum_{n=M_k}^{N_k}\frac1n\lambda(n)\lambda(n+1)

has {0} as an accumulation point. Next fix arbitrary increasing sequences {(M_k)_{k\in{\mathbb N}},(N_k)_{k\in{\mathbb N}}} and take {\mu} to be any weak{^*} accumulation point of the sequence {\big(\mu_{M_k,N_k}\big)_{k\in{\mathbb N}}}. Let {T:\beta{\mathbb N}\rightarrow\beta{\mathbb N}} be the {\mu}-preserving transformation defined by {T{\mathtt p}={\mathtt p}+1:=\{A\subset{\mathbb N}:A-1\in{\mathtt p}\}}. Given a function {f:\beta{\mathbb N}\rightarrow{\mathbb C}}, we denote by {Tf} the composition {f\circ T}.

Since {\beta{\mathbb N}} is the Stone-Čech compactification of {{\mathbb N}}, every bounded function {f:{\mathbb N}\rightarrow{\mathbb C}} can be uniquely identified with a continuous function {\overline{f}\in C(\beta{\mathbb N})}, with the property that {\overline{f}({\mathtt p}_n)=f(n)} for every {n\in{\mathbb N}}. We will abuse the notation by referring to {{\mathtt p}_n} simply as {n} and to {\overline{f}} simply as {f}. Therefore, Theorem 3 is equivalent to the statement that for every probability measure {\mu} obtained as above

\displaystyle \int_{\beta{\mathbb N}}\lambda\cdot T\lambda\,d\mu=0.\ \ \ \ \ (1)

Next we introduce multiplication in the system. Given an ultrafilter {{\mathtt p}\in\beta{\mathbb N}} and {n\in{\mathbb N}}, denote by {n{\mathtt p}} the ultrafilter {n{\mathtt p}:=\{A\subset{\mathbb N}:A/n\in{\mathtt p}\}}, where {A/n:=\{x\in{\mathbb N}:nx\in A\}}. We now introduce the transformation {M^n:\beta{\mathbb N}\rightarrow\beta{\mathbb N}} given by {M^n{\mathtt p}:=n{\mathtt p}}, and given a function {f:\beta{\mathbb N}\rightarrow{\mathbb C}} we denote by {M^nf} the composition {f\circ M^n}. To understand how {\mu} behaves under the multiplication {M^n} action, take {N\in{\mathbb N}} and {f\in C(\beta{\mathbb N})} and observe that

\displaystyle \int_{\beta{\mathbb N}}M^nf\,d\mu_{M,N} = \frac1{\log N/M}\sum_{i=M}^N\frac{f(in)}i = \frac1{\log N/M}\sum_{i=nM}^{nN}\frac{f(i)}{i/n}1_{n{\mathbb N}}(i).


\displaystyle \sum_{i\in[M,N]\triangle[nM,nN]}\frac1i\leq\sum_{i=M}^{nM}\frac1i+\sum_{i=N}^{nN}\frac1i\leq 2\log n+2

we deduce that

\displaystyle \int_{\beta{\mathbb N}}M^nf\,d\mu_{M,N} =\frac1{\log\tfrac NM}\sum_{i=M}^{N}\frac{f(i)}{i}n1_{n{\mathbb N}}(i)+O\left(\frac{\log n}{\log\tfrac NM}\right) =\int_{\beta{\mathbb N}}f\cdot n\cdot 1_{n{\mathbb N}}\,d\mu_{M,N}+O\left(\frac{\log n}{\log\tfrac NM}\right)

Taking {N/M\rightarrow\infty}, we conclude that

\displaystyle  \int_{\beta{\mathbb N}}M^nf\,d\mu=\int_{\beta{\mathbb N}}f\cdot n\cdot 1_{n{\mathbb N}}\,d\mu \ \ \ \ \ (2)

for every measure {\mu} constructed as above. This equation would not hold in general for measures {\mu} which correspond to arbitrary invariant means and is the reason why the logarithmic averaging are used in this proof; indeed we will not make reference to the method of summation beyond using (2).

Given a prime {p}, notice that {\lambda(pn)=-\lambda(n)}. This means that the continuous functions {M^p\lambda} and {-\lambda} on {\beta{\mathbb N}} agree on the dense set {{\mathbb N}}, and hence are the same function. Using (2) and the distributivity law {T^uM^v=M^vT^{uv}} we have that

\displaystyle \int_{\beta{\mathbb N}}\lambda\cdot T\lambda\,d\mu= \int_{\beta{\mathbb N}}M^p\lambda\cdot TM^p\lambda\,d\mu= \int_{\beta{\mathbb N}}M^p\big(\lambda\cdot T^p\lambda\big)\,d\mu= \int_{\beta{\mathbb N}}\lambda\cdot T^p\lambda\cdot p\cdot 1_{p{\mathbb N}}\,d\mu.

More generally, for any finite set {P} of primes we conclude

\displaystyle  \int_{\beta{\mathbb N}}\lambda\cdot T\lambda\,d\mu=\frac1{|P|}\sum_{p\in P}\int_{\beta{\mathbb N}}\lambda\cdot T^p\lambda\cdot p\cdot1_{p{\mathbb N}}\,d\mu. \ \ \ \ \ (3)

To establish (1), which will finish the proof of Theorem 3, we need to show that the left hand side of (3) is {0}. We will do this by showing that the right hand side of (3) gets arbitrarily small for appropriate choices of the set {P}. To achieve this we will use two ingredients: the Matomäki-Radziwill theorem on averages of the Liouville function on short intervals, and the entropy decrement argument of Tao.

— 2. The Matomäki-Radziwill theorem —

The first case of Chowla’s conjecture, Conjecture 1 above, is when {F} is a singleton. In this case it states that the averages {\frac1N\sum_{n=1}^N\lambda(n)} tend to {0} as {N\rightarrow\infty}, and this is known to be equivalent to the prime number theorem. On the other hand, if the Chowla conjecture holds, then there will be arbitrarily long intervals where {\lambda} is constant, so one can not replace the average over the initial interval {[1,N]} with averages over arbitrary intervals of the form {[M,M+N]}. Nevertheless, Matomäki and Radziwill showed that the average of {\lambda} is small for “most” intervals. As a corollary of their main result, using now logarithmic averages, we have

Theorem 4 (Matomäki-Radziwill)

\displaystyle \lim_{H\rightarrow\infty}~\limsup_{N/M\rightarrow\infty}~\frac1{\log\tfrac NM}~\sum_{n=M}^N~\frac1n\left|\frac1H\sum_{h=1}^H\lambda(n+h)\right|=0.

This result was shortly after improved by Matomäki Radziwill and Tao, allowing some “twisting” by exponential functions.

Theorem 5 (Matomäki-Radziwill-Tao) Let {\alpha\in{\mathbb R}} and denote by {e(x):=e^{2\pi ix}}. Then

\displaystyle \lim_{H\rightarrow\infty}~\limsup_{N/M\rightarrow\infty}~\frac1{\log\tfrac NM}~\sum_{n=M}^N~\frac1n\left|\frac1H\sum_{h=1}^H\lambda(n+h)e(h\alpha)\right|=0.

We can interpret this theorem in the context of the measure preserving system constructed in the previous section as saying that

\displaystyle \lim_{H\rightarrow\infty}\int_{\beta{\mathbb N}}\left|\frac1H\sum_{h=1}^He(h\alpha)T^h\lambda \right|=0.

If {g} is any eigenfunction of the system {(\beta{\mathbb N},\mu,T)}, say {Tg=e(\alpha)g}, then

\displaystyle \int_{\beta{\mathbb N}}\lambda\cdot g\,d\mu= \frac1H\sum_{h=1}^N\int_{\beta{\mathbb N}}T^h\lambda\cdot T^h g\,d\mu= \int_{\beta{\mathbb N}}\frac1H\sum_{h=1}^Ne(h\alpha)T^h\lambda\cdot g\,d\mu.

Since eigenfunctions are bounded, we can use Holder’s inequality now to conclude that {\lambda} is orthogonal to every eigenfunction of {(\beta{\mathbb N},\mu,T)}.

It now follows from a theorem of Frantzikinakis, Host and Kra, which is in turn based on deep results of Green and Tao, and of Green, Tao and Ziegler; that

\displaystyle  \lim_{N\rightarrow\infty}\frac1{\pi(N)}\sum_{p\leq N}T^p\lambda=0 \qquad\text{in }L^2(\beta{\mathbb N},\mu) \ \ \ \ \ (4)

where the sum is taken over primes {p}.

For each {N\in{\mathbb N}}, let {P_N} be the set of primes that lie between {N/2} and {N}. In view of (4), in order to show that the left hand side of (3) is {0} (which will finish the proof of Theorem 3), we need to show that

\displaystyle  \liminf_{N\rightarrow\infty}\left|\frac1{|P_N|}\sum_{p\in P_N}\int_{\beta{\mathbb N}}\lambda\cdot T^p\lambda\cdot\big(p\cdot1_{p{\mathbb N}}-1\big)\,d\mu\right|=0 \ \ \ \ \ (5)

— 3. A probabilistic interpretation —

In order to establish (5) we will use a probabilistic argument. Let {X_N:\beta{\mathbb N}\rightarrow\{-1,1\}^{[0,N]}} be the function

\displaystyle X_N=\big(T^n\lambda\big)_{n\in[0,N]},

let {Y_N:\beta{\mathbb N}\rightarrow\prod_{p\in P_N}({\mathbb Z}/p{\mathbb Z})} be the function

\displaystyle Y_N({\mathtt p})=\big({\mathtt p}\bmod p\big)_{p\in P_N},

and let {F_N:\{-1,1\}^{[0,N]}\times\prod_{p\in P_N}({\mathbb Z}/p{\mathbb Z})\rightarrow{\mathbb R}} be the function

\displaystyle F_N\big((\xi_n)_{n\in[0,N]},(a_p)_{p\in P_N}\Big):=\frac1{|P_N|}\sum_{p\in P_N}\xi_0\xi_p\big(p\cdot 1_{\{0\}}(a_p)-1\big)

Then we can rewrite (5) as

\displaystyle  \liminf_{N\rightarrow\infty}\left|\int_{\beta{\mathbb N}}F_N(X_N,Y_N)\,d\mu\right|=0 \ \ \ \ \ (6)

If for some {N\in{\mathbb N}} the random variables {X_N} and {Y_N} are independent, then in particular {\lambda\cdot T^p\lambda} and {p1_{p{\mathbb N}}-1} are orthogonal for every {p\in P_N} and hence {\int_{\beta{\mathbb N}}F_N(X_N,Y_N)\,d\mu=0}. Unfortunately, we will not be able to show that {X_N} and {Y_N} are independent infinitely often (or even for some {N}), which would imply (6), so instead we will need a quantitative version of this observation.

A quantitative version of independence is provided by the notion of mutual information, defined below. As we will show, as long as the mutual information between {X_N} and {Y_N} doesn’t grow too fast along some subsequence, then we still obtain (6). I explore the notion of entropy and its connection with “information” in my previous post; here are the relevant definitions.

Definition 6 Let {(\Omega,\mu)} be a probability space and let {X} and {Y} be two (measurable) functions on {\Omega} with a finite range.

  • The entropy of {X} is

    \displaystyle H(X):=-\sum_{x}\mu(X^{-1}(x))\log\big(\mu(X^{-1}(x))\big),

    where the sum is over the (essential) range of {X}.

  • The mutual information between {X} and {Y} is the quantity

    \displaystyle I(X:Y):=H(X)+H(Y)-H(X,Y)

  • Given a set {A\subset\Omega} with positive measure, we denote by {H(Y\mid A)} the conditional entropy of {Y} on {A}, defined as the entropy of {Y} with respect to the conditional probability measure {\mu_A}, which gives {B\subset\Omega} the measure {\mu_A(B)=\mu(A\cap B)/\mu(A)}.

Observe that {X} and {Y} are independent if and only if {H(X,Y)=H(X)+H(Y)} (see, for instance, this proposition in my earlier post), which is also equivalent to {I(X:Y)=0}. An application of Jensen’s inequality shows that any random variable taking values in a finite set {R} has entropy at most {\log|R|}, with maximum attained when the random variable is uniformly distributed. This and the fact that {Y_N} is uniformly distributed in its range will be important in what follows. In order to show that a lack of mutual information between {X_N} and {Y_N} suffices to obtain a useful upper bound on {\left|\int_{\beta{\mathbb N}}F_N(X_N,Y_N)\,d\mu\right|}, we will use the following Pinsker type inequality.

Lemma 7 Let {(\Omega,\mu)} be a probability space, let {R} be a finite set and let {Y:\Omega\rightarrow R}. For every {E\subset R_Y},

\displaystyle \mu\big(Y^{-1}(E)\big)\leq\frac{\log |R|-H(Y)+1}{\log(|R|/|E|)}

Proof: This is Lemma 1 in this post of Tao. \Box

Two random variables on the probability space {(\Omega,\mu)}, say {X:\Omega\rightarrow R_X} and {Y:\Omega\rightarrow R_Y}, are independent if and only if the diagonal coupling is the same as the product coupling, in the sense that for every function {F:R_X\times R_Y\rightarrow{\mathbb R}},

\displaystyle \int_\Omega F(X(\omega),Y(\omega))\,d\mu(\omega)=\int_\Omega\int_\Omega F(X(\omega_1),Y(\omega_2))\,d\mu(\omega_1)\,d\mu(\omega_2).

The next lemma exploits a lack of mutual information between two random variables to establish a quantitative version of this fact (for certain functions {F}).

Lemma 8 Let {(\Omega,\mu)} be a probability space, let {X:\Omega\rightarrow R_X} and {Y:\Omega\rightarrow R_Y} for some finite sets {R_X} and {R_Y}, let {\epsilon>0} and {P\in{\mathbb N}}. Assume that {Y} is uniformly distributed on {R_Y} and that {I(X:Y)<\epsilon^4 P}. Let {F:R_X\times R_Y\rightarrow[0,1]} be such that for every {x\in R_X}, the random variable {F(x,Y)} is a sum of {P} independent random variables all taking values in {[0,1]}. Then

\displaystyle \left|\int_\Omega F\big(X(\omega),Y(\omega)\big)\,d\mu(\omega) - \int_{\Omega^2}F\big(X(\omega_1),Y(\omega_2)\big)\,d\mu(\omega_2)\,d\mu(\omega_1)\right| \leq3\epsilon+\frac1{\epsilon^2P}

Proof: Since {Y} is uniformly distributed, for any set {A\subset\Omega} we have {H(Y\mid A)\leq H(Y)}. A quick computation shows that

\displaystyle \epsilon^4P\geq I(X:Y)=\sum_{x\in R_X}\mu\big(X^{-1}(x)\big)\Big(H(Y)-H\big(Y\mid X^{-1}(x)\big)\Big)

Applying Markov’s inequality to the function {f(\omega)=H(Y)-H\big(Y\mid X^{-1}(X(\omega))\big)}, whose expectation is {I(X:Y)}, we find that the set {A\subset R_X} consisting of those {x\in R_X} with {H(Y)-H\big(Y\mid X^{-1}(x)\big)<\epsilon^3P} is large, in the sense that the union

\displaystyle S_1:=\bigcup_{a\in F_X\setminus A}X^{-1}(a)

has measure {\mu(S_1)<\epsilon}.

For each {a\in A}, denote by {\mu_a} the conditional measure on {X^{-1}(a)}. Then Lemma 7 implies that for each {a\in A} and every set {B\subset R_Y},

\displaystyle  \mu_a\big(Y^{-1}(B)\big)\leq\frac{\epsilon^3P+1}{\log\big(|R_Y|/|B|\big)}.\ \ \ \ \ (7)

On the other hand, we can apply Hoeffding’s inequality to deduce that, for each {a\in A}, the set {B_a\subset R_Y} defined by

\displaystyle B_a:=\left\{b:\left|F(a,b)-\int_\Omega F\big(a,Y(\omega)\big)\,d\mu(\omega)\right|\geq\epsilon\right\}

has cardinality at most {|B_a|\leq e^{-\epsilon^2P}|R_Y|}. In other words, {\log\big(|R_Y|/|B_a|\big)\geq\epsilon^2P}. Using (7) it follows that for each {a\in A},

\displaystyle \mu_a\big(Y^{-1}(B_a)\big)\leq\frac{\epsilon^3P+1}{\epsilon^2P}=\epsilon+\frac1{\epsilon^2P}.

Now let {S_2=\bigcup_{a\in A}\big\{\omega\in X^{-1}(a):Y(\omega)\in B_a\big\}}. We can estimate its measure by

\displaystyle \mu(S_2)=\sum_{a\in A}\mu\big(X^{-1}(a)\cap Y^{-1}(B_a)\big)=\sum_{a\in A}\mu\big(X^{-1}(a)\big)\mu_a\big(Y^{-1}(B_a)\big)\leq\epsilon+\frac1{\epsilon^2P}.

Finally, if {\omega\in\Omega} is not in {S_1\cup S_2}, then {X(\omega)\in A} and {Y(\omega)\notin B_a}, which implies that

\displaystyle \left|F\big(X(\omega),Y(\omega)\big)-\int_\Omega F(X(\omega),Y(s)\,d\mu(s)\right|\leq \epsilon.

Since {\mu(S_1\cup S_2)\leq2\epsilon+\frac1{\epsilon^2P}} and {\|F\|_\infty\leq1}, integrating over {\Omega}, we conclude that

\displaystyle \left|\int_\Omega F\big(X(\omega),Y(\omega)\big)\,d\mu(\omega) - \int_{\Omega^2}F\big(X(\omega_1),Y(\omega_2)\big)\,d\mu(\omega_2)\,d\mu(\omega_1)\right| \leq3\epsilon+\frac1{\epsilon^2P}


Applying Lemma 8 to {F=F_N}, {X=X_N}, {Y=Y_N}, {P=|P_N|=\frac N{2\log N}(1+o(1))}, and {\epsilon=N^{-1/3}}, it follows that, in order to establish (6) it suffices to show that

\displaystyle  \liminf_{N\rightarrow\infty}\frac{I(X_N,Y_N)}{|P_N|}=0,\ \ \ \ \ (8)

which is the goal of the last section.

— 4. The Entropy Decrement Argument —

In this section we make no use of the fact that {\lambda} is the Liouville function. In particular, the argument works (and (8) holds) if we replace {\lambda} with any function with a finite image!

The definition of the entropy of a random variable {X:\Omega\rightarrow R} only depends on the partition {\mathcal{X}:=\{X^{-1}(x):x\in R\}} of {\Omega} induced by {X}. If {T:\Omega\rightarrow\Omega} is a measure preserving transformation, then the random variable {TX} induces a different partition then {X}, but there is a bijection between the two partitions that preserves the measure. Therefore {H(X)=H(TX)}. Given two random variables {X,Y}, the conditional entropy is defined as {H(X\mid Y)=H(X,Y)-H(Y)=H(X)-I(X,Y)}. Therefore

\displaystyle H(TX\mid TY)=H(TX,TY)-H(TY)=H(X,Y)-H(Y)=H(X\mid Y).

Recall that {X_N=\big(T^n\lambda\big)_{n\in[0,N]}}. Introducing the new random variables {X_{M,M+N}:=\big(T^n\lambda\big)_{n\in[M+1,M+N]}} it is clear that {X_{M,M+N}=T^MX_N} and so its entropy does not depend on {M}. On the other hand, {X_{M+N}=(X_M,X_{M,M+N})}, so {H(X_{M+N})\leq H(X_M)+H(X_N)}.

The partition of {\beta{\mathbb N}} induced by {Y_N} is invariant under {T}, and therefore

\displaystyle H(X_{M,M+N}\mid Y_N)=H(T^MX_N\mid Y_N)=H(T^MX_N\mid T^MY_N)=H(X_N\mid Y_N)

which in turn implies that

\displaystyle  \begin{array}{rcl}  \displaystyle H(X_{M+N}\mid Y_N) &\leq& \displaystyle H(X_M\mid Y_N)+H(X_{M,M+N}\mid Y_N) \\&=& \displaystyle H(X_M\mid Y_N)+H(X_N\mid Y_N) \end{array}

Iterating this fact we deduce that, for every {N,k\in{\mathbb N}},

\displaystyle H(X_{kN}\mid Y_N)\leq kH(X_N\mid Y_N).

Finally we deduce

\displaystyle  \begin{array}{rcl}  \displaystyle H(X_{kN}) &\leq& \displaystyle H(X_{kN},Y_N)= H(X_{kN}\mid Y_N)+H(Y_N)\leq kH(X_N\mid Y_N)+H(Y_N) \\&=& \displaystyle kH(X_N)-kI(X_N,Y_N)+H(Y_N) \end{array}

Recall that we are trying to establish (8). Suppose, for the sake of a contradiction, that

\displaystyle \liminf_{N\rightarrow\infty}\frac{I(X_N,Y_N)}{|P_N|}>0

and let \epsilon<\liminf_{N\to\infty}I(X_N,Y_N)/|P_N|. Then for every large enough N we have I(X_N,Y_N)\geq\epsilon|P_N| and thus

\displaystyle  H(X_{kN})\leq kH(X_N)-k\epsilon|P_N|+H(Y_N) \ \ \ \ \ (9)

On the other hand, using the prime number theorem we can estimate {|P_N|=(1/2+o(1))N/\log N} and

\displaystyle H(Y_N)=\log\left(\prod_{p\in P_N}p\right)\leq \log N|P_N|=N\big(1/2+o(1)\big)

so for every {N} large enough, dividing (9) by {kN} we obtain

\displaystyle \frac{H(X_{kN})}{kN}\leq\frac{H(X_N)}N-\frac{\epsilon}{3\log N}+\frac1k

and choosing {k=6\log N/\epsilon}, we deduce that

\displaystyle  \frac{H(X_{kN})}{kN}\leq \frac{H(X_N)}N-\frac{\epsilon}{6\log N} \ \ \ \ \ (10)

Finally, let {N_1} be the first value of {N} for which (10) holds, and let {N_{i+1}=N_i(6\log N_i/\epsilon)} we deduce that

\displaystyle  \frac{H(X_{N_i})}{N_i}\leq\frac{H(X_{N_1})}{N_1}-\frac{\epsilon}6\sum_{j=1}^i\frac{1}{\log N_j}. \ \ \ \ \ (11)

Since the sequence {M_i:=\log N_i} satisfies the recurrence relation {M_{i+1}=M_i+\log(6M_i/\epsilon)}, a quick induction implies that {M_i\leq Ci\log i} for some absolute constant {C}, and hence the right hand side of (11) becomes negative for large enough {i}, which is clearly a contradiction. We conclude that (8) holds and this finishes the proof of Theorem 3.

This entry was posted in Number Theory, Probability and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s