## The restricted isometry property.

I have recently worked on a paper about RIP matrices, and decided to learn better about these objects. This post is the adaptation of some notes I wrote on the subject.

Let ${K be natural numbers. A vector ${x\in{\mathbb R}^n}$ is ${K}$-sparse if it has at most ${K}$ non-zero entries.

Definition 1 (RIP matrices) Let ${n>m}$ be positive integers and let ${\Phi}$ be a ${m\times n}$ matrix with real entries, let ${\delta>0}$ and let ${K be an integer. We say that ${\Phi}$ is ${(K,\delta)}$-RIP if for every ${K}$-sparse vector ${x\in{\mathbb R}^n}$ we have

$\displaystyle (1-\delta)\|x\|\leq\|\Phi x\|\leq(1+\delta)\|x\|$

Thus ${\Phi}$ acts almost as an isometry when we restrict our attention to ${K}$-sparse vectors.

The restricted isometry property is perhaps the simplest’ property of a matrix which makes it good’ for compressed sensing. It appeared as part of the outcome of a series of papers by Donoho, and CandèsRombergTao in 2004-2006 where other slightly different properties were studied (see these papers and also this survey). The earliest appearance of the name RIP that I could find was here, but even there the exact property is a little different from the modern one (which first appeared, I believe, here).

In the next section I present a very short introduction to the ideas behind compressed sensing and how RIP matrices naturally arise. A more complete exposition of this material is Tao’s blog post on this subject. In the final section I address the (pure) mathematic interest in RIP matrices, describe some random procedures to construct RIP matrices, and mention some recent efforts to create deterministic algorithms.

— 1. Digital signals, compressed sensing and RIP —

A signal is a vector ${x\in{\mathbb R}^n}$ and can represent a digital sound or image. A measurement of ${x}$ is the inner product ${\langle x,\psi\rangle}$ with some given vector ${\psi\in{\mathbb R}^n}$. If we want to acquire all the entries (think pixels) of ${x}$, we need to make ${n}$ measurements, in each taking ${\psi}$ to be an element of the canonical basis and obtaining one entry (pixel). In this case, the measurement matrix, i.e., the matrix whose lines are the vectors ${\psi}$, is just the identity matrix.

Taking the example when the signal ${x}$ is an image, note that the “interesting” images have a large degree of structure. Indeed, a typical (random) image will have each pixel independent of its neighbors, resulting in something like this picture.

A random image

In general, computer scientists realized that, after an orthonormal change of coordinates, an “interesting” signal ${x}$ will have very few non-zero coordinates, when compared to the dimension ${n}$. In other words, interesting signals are ${K}$-sparse for a relatively small ${K}$.

This fact has been used for several years in computers to store signals (which are assumed to be sparse). Instead of storing all the entries of ${x}$, one instead stores only the non-zero coordinates of ${x}$ together with the positions of those entries. However, during the process of acquiring the signal (for instance, taking a photograph) one still needs to capture all the coordinates of ${x}$.

The idea of compressed sensing is to use much less than ${n}$ measurements and still recover ${x\in{\mathbb R}^n}$, assuming it is sparse. If one knows which ${K}$ coordinates of ${x}$ are non-zero one only needs ${K}$ measurements to encode ${x}$. However different signals will have the non-zero entries in different places. Thus what is needed is a measurement matrix ${\Phi}$ such that ${x}$ can be uniquely recovered from ${\Phi x}$, assuming only that ${x}$ is ${K}$-sparse.

Lemma 2 Let ${\Phi}$ be a ${m\times n}$ measurement matrix such that any ${K}$ columns of ${\Phi}$ are linearly independent. Then any ${K/2}$-sparse vector ${x}$ can be uniquely determined from ${\Phi x}$.

Proof: Assume ${\Phi}$ is as described and ${\Phi x=\Phi y}$ for two ${K/2}$-sparse signals ${x,y\in{\mathbb R}^n}$. Then ${\Phi(x-y)=0}$. Since ${x-y}$ is ${K}$-sparse, this implies that there is a linear combination of ${K}$ of the columns of ${\Phi}$ equal to ${0}$, and by the hypothesis on ${\Phi}$ it follows that ${x-y=0}$. $\Box$

Thus if any ${K}$ columns of a ${K\times n}$ matrix ${\Phi}$ are linearly independent, one can recover a ${K/2}$-sparse signal ${x\in{\mathbb R}^n}$ from only ${b=\Phi x\in{\mathbb R}^K}$. Unfortunately, this theoretical result gives a very slow algorithm:

Proposition 3 Let ${\Phi}$ be a ${m\times n}$ measurement matrix such that any ${K}$ columns of ${\Phi}$ are linearly independent. Let ${x\in{\mathbb R}^n}$, let ${b=\Phi x}$ and let

$\displaystyle x_0\in\{z\in{\mathbb R}^n:\Phi z=b\}\text{ be the vector with fewest non-zero coordinates}$

If ${x}$ is ${K/2}$-sparse, then ${x_0=x}$.

However, to find ${x_0}$ is an NP-hard problem, and thus impractical in real life problems. An alternative is to find the vector in the hyperplane ${\{z\in{\mathbb R}^n:\Phi z=b\}}$ which minimizes the ${\|\cdot\|_1}$ norm.

$\displaystyle \text{Let }x_1\in\{z\in{\mathbb R}^n:\Phi z=b\}\text{ be the vector which minimizes }\|z\|_1 \ \ \ \ \ (1)$

There is, in general, no reason to believe that whenever ${x}$ is a sparse vector and ${b=\Phi x}$ we have ${x_1=x}$. However, that holds true for special types of measurement matrices, and in particular when the measurement matrix ${\Phi}$ is RIP:

Theorem 4 Let ${K be positive integers, let ${\delta<1/3}$ and let ${\Phi}$ be a ${m\times n}$ matrix satisfying the ${(K,\delta)}$-RIP. If ${x\in{\mathbb R}^n}$ is ${K/4}$-sparse, ${b=\Phi x}$ and ${x_1}$ is given by (1), then ${x=x_1}$.

It is hard to decide who to credit for this theorem. In this form it appeared in this paper, but it is only the last evolutionary stage of a long collection of similar results, where the RIP is replaced with a slightly different property, some noise is present, or the conclusions are weakened. A good source for more information is this survey.

Theorem 4 tells us that, if one can easily construct RIP matrices, one can use significantly fewer than ${n}$ measurements of a signal ${x\in{\mathbb R}^n}$, and still recover the signal (assuming it is sparse). Moreover, it is computationally feasible to recover ${x}$ from the measurement ${b=\Phi x}$.

This is most important in applications where taking measurements is expensive’ such as an MRI scan or an astronomical observation. Another possible application is in digital cameras, which would not be required to process the signal captured and hence save the battery.

— 2. Constructing RIP matrices —

In view of Theorem 4, it is desirable to be able to construct RIP matrices. Specifically, one wants to find a ${(K,\delta)}$-RIP matrix of dimensions ${m\times n}$ with the number of measurements, ${m}$, being as small as possible, with respect to the dimension of the signal ${n}$ and the level of sparsity, ${K}$. For some reason, this is usually expressed as trying to maximize the level of sparsity ${K}$, depending on the number of measurements ${m}$ and the dimension ${n}$.

— 2.1. Random constructions —

The constructions of RIP matrices with lowest number of measurements use randomness. Here I present just a few examples:

Theorem 5 Let ${m be positive integers, let ${\delta>0}$ and let ${K=O\left(\frac m{\log^4n}\right)}$. Let ${\Phi}$ be the random matrix defined by one of the following methods.

• (Normal entries)Let the entries of ${\Phi}$ be i.i.d. with a normal distribution ${N(0,1/m)}$.
• (Bernoulli entries)Let the entries of ${\Phi}$ be i.i.d. with a Bernoulli distribution taking the values ${\pm1/\sqrt{m}}$, each with ${50\%}$ probability.
• (Random rows of the Discrete Fourier transform)Let ${A\subset\{0,\dots,n-1\}}$ be a random subset of size ${m}$. Let ${\Phi}$ be the matrix obtained from the Discrete Fourier transform matrix (i.e. the matrix ${F}$ with entries ${F[\ell,j]=e^{-2\pi i\ell j/n}/\sqrt{n}}$ for ${\ell,j\in\{0,\dots,n-1\}}$) by selecting the rows indexed by ${A}$.

Then ${\Phi}$ is ${(K,\delta)}$-RIP with probability ${p\approx1-e^{-n}}$ .

The first two cases of Theorem 5 are from this paper, and the last case is from this one. We are not proving this theorem here, but it is worth noticing that the proof involves an old lemma of W. B. Johnson and J. Lindenstrauss (the father of the fields medalist E. Lindenstrauss) on Lipschitz mappings into Hilbert spaces.

Lemma 6 (Jonhson-Lindenstrauss) Let ${k be positive integers and let ${X\subset{\mathbb R}^n}$ be a finite set of cardinality ${k}$. For every ${\epsilon>0}$ there exists a linear map ${f:{\mathbb R}^n\rightarrow{\mathbb R}^m}$, with ${m=\frac{8\log k}{\epsilon^2}}$, such that

$\displaystyle (1-\epsilon)\|x-y\|^2\leq\|f(x)-f(y)\|^2\leq(1+\epsilon)\|x-y\|^2$

for all ${x,y\in X}$.

— 2.2. Deterministic constructions —

It is somewhat surprising that, while almost every matrix is RIP, constructing one seems to be a very challenging problem. This situation appears in many other places in mathematics, for instance in explicitly finding normal numbers, or finding Ramsey graphs (graphs in ${n}$ vertices without any clique or independent set of size ${\log n}$). The general problem of constructing deterministic objects with a certain property, for which it is know that a random object will have that property has been called finding hay in a haystack.

The problem of finding a deterministic family of matrices of increasing dimensions which satisfies the restricted isometry property was explicitly formulated by Tao in his blog (he uses the term UUP instead of RIP, standing for uniform uncertainty principle). This paper surveys several different approaches that have been used to make progress in this problem.

One of the most successful techniques to construct deterministic RIP matrices is based on the Gershgorin circle theorem:

Theorem 7 (Gershgorin) For each eigenvalue ${\lambda}$ of a ${K\times K}$ matrix ${A}$, there is an index ${i\in\{1,\dots,K\}}$ such that

$\displaystyle \left|\lambda-A[i,i]\right|\leq\sum_{j\neq i}\Big|A[i,j]\Big|$

Using this result one can show the following:

Proposition 8 Let ${n,m}$ be positive integers and let ${\Phi}$ be a ${m\times n}$ matrix where each columns has ${\ell^2}$ norm ${1}$. Then for any ${K we have that ${\Phi}$ is a ${\big(K,(K-1)\mu\big)}$-RIP, where ${\mu}$ is the maximum inner product between two columns of ${\Phi}$.

This proposition has been used to construct RIP matrices with sparsity ${K=\sqrt{m}}$. However, it has been shown that this is the highest possible level of sparsity one can get from Theorem 7; this follows from the Welch bound.

This square root bottleneck’ is present in most other deterministic constructions. The only construction (to date) to beat this bottleneck is due to J. Bourgain, S. Dilworth, K. Ford, S. Konyagin and D. Kutzarova, but even there they obtain ${K=m^{1/2+\epsilon}}$ for an explicit ${\epsilon}$ of size

$\displaystyle \epsilon\approx 5.5169\times10^{-28}$

In a series of blog posts, Dustin Mixon and Afonso Bandeira increased this to

$\displaystyle \epsilon\approx4.4466\times10^{-24}$

which seems to be the best bound currently known.