Sparse & Redundant Representations and Their Applications in Signal and Image Processing Sparseland: An Estimation Point of View Michael Elad The Computer Science Department The Technion Israel Institute of technology Haifa 3000, Israel
A Strange Experiment
What If Consider the denoising problem min s.t. D z 0 and suppose that we can find a group of J candidate solutions J j j1 such that j 0 j D j z n Basic Questions: What could we do with such a set of competing solutions in order to better denoise z? Why should this help? How shall we practically find such a set of solutions? Relevant work: [Leung & Barron ( 06)] [Larsson & Selen ( 07)] [Schintter et. al. (`08)] [Elad and Yavneh ( 08)] [Giraud ( 08)] [Protter et. al. ( 10)]
Why Bother? Because each representation conveys a different story about the signal Because pursuit algorithms are often wrong in finding the sparsest representation, and then relying on their solution is too sensitive And then maybe there are deeper reasons? D D 1
Generating Many Sparse Solutions Our answer: Randomizing the OMP Initialization k 0, 0 r y D z 0 0 0 and S 0 Main Iteration T 1. Compute p(i) di rk 1 for 1 i m k k 1. Choose i0 s.t. 1 i m, p(i 0) p(i) 3. Update S k : Sk Sk 1 i0 4. LS : k min D z s.t. sup S k 5. Update Residual: r z D No k r k k Yes Stop We Randomize Step Choose i Choose i 0 with probability 0 T i 0 exp c d r s.t. 1 i m, p(i ) p(i) k1 For now, let s set the parameter c manually for best performance. Later we shall define a way to set it automatically
Let s Try The Following: Form a random dictionary D Multiply by a sparse vector α 0 having 10 non-zeros Add Gaussian iid noise v with σ=1 and obtain z Solve the P 0 problem by OMP, and obtain α OMP min s.t. D z 10 0 Use Random-OMP and obtain the set {α j RandOMP } Let s look at the obtained representations 100 D 00 + = 0 v z
Results: The Obtained Set {α j RandOMP } The OMP gives the sparsest solution (NNZ=) The others are in the range [,0] As expected, all the representations satisfy D z 100
Results: Denoising Performance? Dˆ D z D 0 0 0.1753 Even though OMP is the sparsest, it is not the most effective for denoising The cardinality does not reveal its efficiency
And now to the Surprise Lets propose the average 1000 1 ˆ 1000 j1 as our representation RandOMP j This representation is not sparse at all and yet Dˆ D0 0.05 z D 0 compared to 0.1753 for OMP
Is it Consistent? Let s repeat this experiment many (1000) times Dictionary (random) of size n=100, m=00 True support of α is 10 We run OMP for denoising We run RandOMP J=1000 times and average 1 J RandOMP ˆ j J j1 Denoising is assessed as before OMP: 0.1808 ROMP: 0.1077 Cases of zero solution The results of 1000 trials lead to the same conclusion How could we explain this?
A Crash-Course on Estimation Theory
Defining Our Goal We are interested in the signal x Unfortunately, we get instead a measurement z We do know that z is related to x via the following conditional probabilities P(z x) or P(x z) Estimation theory is all about algorithms for inferring x from z based on these Obviously, a key element in our story is the need to know these P s sometimes this is a tough task by itself
The Maximum-Likelihood (ML) P(z x) This conditional is known as the likelihood function, describing the probability of getting the measurements z if x is given A natural estimator to propose is the Maximum Likelihood ˆx ML argmax P z x x ML is very popular, but in many situations it is quite weak or even useless This brings us to the Bayesian approach
The MAP Estimation P(z x) P(x z) This is a deep philosophical change in the approach we take, as now we consider x as random as well Due to Bayes rule we have that P z x P x P x z Pz The Maximum A posteriori Probability (MAP) estimator suggests MAP x ˆx argmax P x z argmax P z x P x x
The MAP A Closer Look The MAP estimator is given by ˆx argmax P x z argmax P z x P x MAP x In words, it seeks the most probable x given z, which fits exactly our story (z is given while x is unknown) MAP resembles ML with one major distinction P(x) enters and affects the outcome. This is known as the prior This is the Bayesian approach in estimation, but MAP is not the ONLY Bayesian estimator x
MMSE Estimation Given the posterior P(x z), what is the best one could do? Answer: find the estimator that minimizes the Mean-Squared-Error, given by ˆ ˆ ˆ MSE x x x P x z dx E x x z Lets take a derivative w.r.t. the unknown ˆ ˆx x x P x z dx 0 x P x z dx P x z dx E x z MMSE estimation is a conditional expectation
MMSE versus MAP In general, these two estimators are different They align if the posterior is a symmetric (e.g. Gaussian distribution) Typically, MMSE is much harder to compute, which explains the popularity of MAP P x z ˆx MAP ˆx MMSE x
Sparseland : An Estimation Point of View
Our Signal Model n D is fixed and known Assume that is built by: o Choosing the support s with probability P(s): P(kS)=p are drawn independently o Choosing the s coefficients using iid Gaussian entries IN(0, x ) m D x The ideal signal is x=d=d s s α P(α) and P(x) are known
Adding Noise n m D x + The noise v is additive white & Gaussian α P z x C exp x z P(z α) & P(α z) and even P(z s) & P(s z) could all be derived v z
So, Lets Estimate Given P(α z) or P(s z) MAP Oracle known support s MMSE MAP ˆ ArgMax P( z) oracle ˆ MMSE ˆ E z or MAP ŝ ArgMax P(s z) Why Oracle? Because It is a building block in the derivation of MAP and MMSE It poses a bound on the best achievable performance
Deriving the Oracle Estimate s P z,s P z sp s Pz P z z Ds s P s exp P z s exp z Ds s s P s z exp x ˆ 1 1 D D 1 I 1 D T T s s s s x Q 1 s h s z s x This is both MAP and MMSE The estimate of x is obtained by D s s
The MAP Estimate of the Support MAP P(z s)p(s) ŝ ArgMax P(s z) ArgMax s s P(z) Using a marginalization trick we get P(z s) P(z s, )P( )d s s s s T 1 s hsqs hs log(det( Qs)) x exp The expression within the integral is pure Gaussian and thus a closed-form expression is within reach
The MAP Estimate of the Support MAP P(z s)p(s) ŝ ArgMax P(s z) ArgMax s s P(z) Based on our prior for p 1 p P generating the support s 1 p s m T 1 s MAP hsqs hs log(det( Qs)) p Max exp s (1 p) x ŝ
The MAP Estimate: Implications T 1 s MAP hsqs hs log(det( Qs)) p Max exp s (1 p) x ŝ The MAP estimator requires to test all the possible supports for the maximization Once the support is found, the oracle formula is used to obtain s This process is usually impossible due to the combinatorial # of possibilities This is why we rarely use exact MAP, and replace it with an approximation (e.g., OMP)
The MMSE Estimation MMSE ˆ E z P(s z) E z,s s P(s z) P(s) P(z s)... This is the oracle for s 1 ˆ s Q E z,s h T 1 hsqs hs log(det( Qs)) p exp (1 p) x s s s ˆ MMSE s P(s y) ˆ s
The MMSE Estimation: Implication ˆ MMSE s P(s y) ˆ The best estimator (in terms of L error) is a weighted average of many sparse representations!!! As such, it is not expected to be sparse at all As in the MAP case, one cannot compute this expression, as the summation is over a combinatorial set of possibilities. We should propose approximations here as well s
Sparseland: Approximate Estimation
The Case of s =1 P(s z) where T 1 s hsqs hs log(det( Qs)) p exp (1 p) x 1 1 1 Q D D I & h D z T T s s s s s x s =1 implies that D s has only one column and thus Q s =1/ +1/ x (scalar) and h s =(z T d k )/ The right-most term is a constant and omitted Little bit of algebra and we get 1 x P(s z) exp (z d k) x T
The Case of s =1: A Closer Look 1 x P(s z) exp (z d ) x This is c in the Random-OMP T k Based on this we can propose the first step of a greedy algorithm for both MAP and MMSE: MAP Choose the atom with the largest (z T d k ) value (out of m) The same as OMP does MMSE Compute these m probabilities, and draw at random an atom from this distribution this is exactly what Random-OMP does
What About the Next Steps? Random-OMP Suppose we have k-1 non-zeros, and we are about to choose the k-th one: Option 1: Compute the probabilities of P(s z) T 1 hsqs hs log(det( Qs)) P(s z) exp (m-k+1 scalar values) in which the k-1 chosen atoms are fixed, and use this distribution to either maximize or draw a random atom Option (Simpler): Use the same rule as in the first step to proceed 1 x P(s z) exp (r d ) x T k1 k
Relative Representation Mean-Squared-Error A Demo These results correspond to a small dictionary (10 16), where the combinatorial formulas can be evaluated as well Parameters: n,m: 10 16 p=0.1 σ x =1 J=50 (RandOMP) 0.8 0.6 0.4 0. Averaged over 1000 0 experiments 0. 0.4 0.6 0.8 1 1 Oracle MMSE MAP OMP Rand-OMP
Few Words on the Unitary Case We will not derive the equations for this case, and simply show the outcome: Both MAP and MMSE have a closedform solution T T 1 p MAP c z dk z dk log c k ˆ p 1 c 0 Otherwise MAP ˆ k The MAP estimate is obtained by 0 hard-thresholding of this form: x -1 c 3 1 - p=0.1 =0.3 x =1-3 -3 - -1 0 1 3 T z d k x
Few Words on the Unitary Case As for the MMSE estimate, we get a smoothed shrinkage curve: ˆ MMSE k c T p 1 c exp z d k 1 p c 1 p 1 c exp 1 p T z d k This leads to a dense representation vector, just like the one we have seen earlier, both in the deblurring result and in the synthetic experiment MMSE ˆ k 3 1 0-1 - p=0.1 =0.3 x =1-3 -3 - -1 0 1 3 T z d k
Relative Representation Mean-Squared-Error A Demo These results correspond to a unitary dictionary of size n=100. In this case, exact MAP & MMSE are accessible, and we have also formulae for the error Parameters: n,m: 100 100 p=0.1 σ x =1 5000 experiments 0.5 0. 0.15 0.1 0.05 Empirical Oracle Theoretical Oracle Empirical MMSE Theoretical MMSE Empirical MAP Theoretical MAP 0 0 0.5 1 1.5
MMSE: Back to Reality
The Main Lesson The main lesson to take from the above discussion is this: Even under the Sparseland model which assumes that our signals are built of sparse combination of atoms estimation of such signals is better done by using a dense representation
Implications This is a counter-intuitive statement, and yet it is critical to the use of Sparseland This may explain few of the results we saw earlier: In the deblurring experiment we got the best result for a very dense representation In the Thresholding-based denoising experiment, the best threshold led to a dense representation Warning: This does not mean that any dense representation is good!!
Merging MMSE Estimation in Our Algorithms The concept of the MMSE estimation can be added to various algorithms in order to boost their performance For example: o o In the deblurring task, one may seek several sparse explanations for the signal by Random- OMP and then average them. This is challenging due to the dimensions involved When denoising patches (e.g. the K-SVD algorithm), one could replace the OMP by a Random-OMP, and get a better denoising outcome. In practice, the benefit in this is low due to the later patch-averaging
Here is Another Manifestation of MMSE Consider the Following Rationale Observation: K-SVD denoising applied on portions of an image, leads to improved results Thus: seems that locally adaptive dictionary is beneficial At the extreme, use a different dictionary for each pixel Alternative: MMSE estimate, allowing dense combination of atoms A sparse representation in this case is bad Atoms are noisy Too hard to train - Use the surrounding patches as the atoms
The NLM Algorithm: The Practice ˆx k 0 k where w k,k w k,k Comments: w k,k k exp R 0 k R 0 k z z R 0 h Once the cleaned patches are created, they are averaged as they overlap. Note that the original NLM simply uses the center pixel, and thus averaging is not done Our interpretation of NLM is very far from the original idea the authors had in mind k 0 z
Summary Sparsity and Redundancy are used for denoising of signals/images How? Estimation theory tells us exactly what should be done (or approximate) All this provides an interesting explanation to earlier observations Conclusions? Averaging leads to better denoising, as it approximates the MMSE So can we do better than OMP?