Coded Acquisition of High Speed Videos with. Multiple Cameras

Size: px

Start display at page:

Download "Coded Acquisition of High Speed Videos with. Multiple Cameras"

Samuel Blake
6 years ago
Views:

1 Coded Acquisition of High Speed Videos with Multiple Cameras

2 CODED ACQUISITION OF HIGH SPEED VIDEOS WITH MULTIPLE CAMERAS BY REZA POURNAGHI, M.A.Sc. (Electrical Engineering) McMaster University, Hamilton, Canada a thesis submitted to the department of electrical & computer engineering and the school of graduate studies of mcmaster university in partial fulfilment of the requirements for the degree of Doctor of Philosophy c Copyright by Reza Pournaghi, January 215 All Rights Reserved

3 Doctor of Philosophy (215) (Electrical & Computer Engineering) McMaster University Hamilton, Ontario, Canada TITLE: Coded Acquisition of High Speed Videos with Multiple Cameras AUTHOR: Reza Pournaghi B.Sc., (Information Technology Engineering) Sharif University of Technology, Tehran, Iran M.A.Sc. (Electrical Engineering) McMaster University, Hamilton, Canada SUPERVISOR: Dr. Xiaolin Wu NUMBER OF PAGES: xiii, 162 ii

4 To my lovely wife Azin without whose caring support it would not have been possible. To my family, Farideh, Parviz and Mona who always offered me unconditional love and support.

5 Abstract High frame rate video (HFV) is an important investigational tool in sciences, engineering and military. In ultrahigh speed imaging, the obtainable temporal, spatial and spectral resolutions are limited by the sustainable throughput of in-camera mass memory, the lower bound of exposure time, and illumination conditions. In order to break these bottlenecks, we propose a new coded video acquisition framework that employs K 2 cameras, each of which makes random measurements of the video signal in both temporal and spatial domains. For each of the K cameras, this multicamera strategy greatly relaxes the stringent requirements in memory speed, shutter speed, and illumination strength. The recovery of HFV from these random measurements is posed and solved as a large scale l 1 minimization problem by exploiting joint temporal and spatial sparsities of the 3D signal. Three coded video acquisition techniques of varied trade offs between performance and hardware complexity are developed: frame-wise coded acquisition, pixel-wise coded acquisition, and column-row-wise coded acquisition. The performances of these techniques are analyzed in relation to the sparsity of the underlying video signal. To make ultra high speed cameras of coded exposure more practical and affordable, we develop a coded exposure video/image acquisition system by an innovative assembling of multiple rolling shutter cameras. Each of the constituent rolling shutter cameras adopts a random pixel read-out mechanism by simply changing the read out order of pixel rows from sequential to random. Simulations of these new image/video coded acquisition techniques are carried out and experimental results are reported. iv

6 Acknowledgements I would like to express my great gratitude to my supervisor, Dr. Xiaolin Wu for his support and encouragement. This thesis would not have been possible without his guidance, encouragement and patience. I would also like to thank my committee members, Dr. Shirani and Dr. Zhang for their encouragement and insightful comments. I am also graceful to Dr. Davidson and Dr. Dumitrescu for helpful discussions. My appreciation goes to my colleagues at Multimedia Signal Processing Laboratory, Dr. Dadkhah, Y. Fazliani and H. Rezaee, my friends at ECE Department, H. Meshgi, P. Taatizadeh, A. Vali, Dr. Mousakhani and all my friends who supported me and made my time a lot more fun. I would also like to acknowledge the administrative and technical staff of the ECE department of McMaster University, Cheryl Gies, Terry Greenlay, and Dan Manolescu for their friendly assistance and expert technical support. v

7 Notation and abbreviations ADC AE CCD CFA CMOS CRW CS CSR DBD DCV DFT FPS FW HDR HFV i.i.d LFSR Analog-to-Digital Converter Auto-Exposure Charge-Coupled Device Color Filter Array Complementary Metal-Oxide-Semiconductor Column-Row Wise Compressive Sensing Centralized Sparse Representation Distinct Block Diagonal Deconvolution Discrete Fourier Transform Frames Per Second Frame Wise High Dynamic Range High Frame Rate Video Independent Identically Distributed Linear Feedback Shift Register vi

8 NCSR NSP PCA PW PSF PSNR RBD SCN SNR SSD Non-locally Centralized Sparse Representation Null Space Property Principle Component Analysis Pixel Wise Point Spread Function Peak Signal-to-Noise Ratio Repeated Block Diagonal Sparse Coding Noise Signal-to-Noise Ratio Solid State Drive vii

9 Contents Abstract iv Acknowledgements v Notation and abbreviations vi 1 Introduction Relation to Compressive Sensing Related Works Contribution Outline of Thesis Overview of Compressive Sensing Theory Compressive Sensing Framework Conditions for Sparse Recovery Designing a Stable Measurement Matrix Multi-camera Coded Exposure Systems Frame-wise Coded Exposure Pixel-wise Coded Exposure Column-row-wise Coded Exposure viii

10 3 Theoretical Analysis of the Coded Exposure Systems RIP of Measurement Matrices of the Coded Exposure Systems RIP for Frame-wise and Pixel-wise Coded Exposure Systems RIP for Column-row-wise Coded Exposure System Remarks on the RIP of Ȧ f, Ȧ p and Ȧcr Coded Exposure System with Random Rolling Shutter Coded Rolling Shutter Architecture in Literatures Multi-camera Coded Exposure System with Coded Rolling Shutter Cameras Recovery of High Frame-rate Video Signals Sparse Analysis Model Dictionary Based Sparse Coding Dictionary Based Sparse Coding for HFV Recovery Building PCA Sub-dictionaries Non-local Estimate of Unknown Sparse Code Algorithm Complexity Complexity of Sparse Analysis Recovery Algorithm Complexity of the Dictionary based Recovery Algorithm System Attributes HFV Recovery using Relative Displacement of Cameras HFV Recovery with Parallel Processing Capture and Recovery of Color Information Capturing Color Information in Coded Exposure Systems ix

11 6.3.2 Recovering Color Information Simulation Results and Discussions Experiment Experiment Experiment Experiment Conclusion and Future Works Future Works References 143 A Toolbox 152 A.1 General Results A.2 Proof of Lemmas x

12 List of Figures 1.1 The schematic of proposed multiple coded exposure cameras CS sampling process for compressible signals Geometry of l 1 and l 2 minimization in 3-dimensional space Capturing process of a camera of frame-wise coded exposure Capturing process of a camera of pixel-wise coded exposure Capturing process of a camera of the column-row-wise coded exposure Schematic design of column-row-wise coded exposure Typical CMOS rolling shutter imager with different exposure times Multi-camera code rolling shutter system Coded readout schemes proposed in [32] Coded exposure schemes proposed in [32] Sample timing setting for reset and select signal in a coded rolling shutter camera system Calculating random binary sequences of two coded rolling shutter camera Relative positioning of the cameras in the proposed multi-camera systems Capturing color information with separate sensor arrays Capturing color information with rotating color filters Color information captured with a single sensor array with Bayer pattern 94 xi

13 6.5 Different CFA patterns used in digital cameras Multi-camera coded exposure system with with random pattern CFA Band difference images of natural signals PSNR vs frame index of Airbag Explosion 1 HFV signal PSNR vs frame index of Airbag Explosion 2 HFV signal Average PSNR vs. number of cameras of column-row-wise coded exposure and coded rolling shutter schemes Sorted magnitude of 3D DFT coefficients of HFV signals PSNR vs frame index of the recovered HFV signals for recovery algorithms based on sparse analysis and sparse synthesis models Snapshots of recovered HFV signals using recovery algorithms based on sparse synthesis and sparse analysis models Snapshots of recovered HFV signals using recovery algorithms based on sparse synthesis and sparse analysis models PSNR vs frame index of the recovered HFV signals using different exposure schemes and different number of cameras PSNR vs frame index of the recovered HFV signals using different exposure schemes and different number of cameras Snapshots of recovered Airbag Explosion 1 (1fps) HFV sequence for different exposure schemes and for different number of cameras Snapshots of recovered Car Crash (1fps) HFV sequence for different exposure schemes and for different number of cameras Snapshots of recovered Flying Bird (1fps) HFV sequence for different exposure schemes and for different number of cameras xii

14 7.13 Snapshots of recovered Airbag Explosion 2 (2fps) HFV sequence for different exposure schemes and for different number of cameras Snapshots of recovered Cutting Apple (2fps) HFV sequence for different exposure schemes and for different number of cameras Snapshots of recovered Water Drop (2fps) HFV sequence for different exposure schemes and for different number of cameras PSNR vs frame index of the recovered HFV signals using recovery algorithm with dictionary learning and the one with fixed sparsity basis PSNR vs frame index of the recovered HFV signals using recovery algorithm with dictionary learning and the one with fixed sparsity basis Snapshots of recovered Airbag Explosion 1 (1fps) HFV sequence for different recovery algorithms Snapshots of recovered Car Crash (1fps) HFV sequence for different recovery algorithms Snapshots of recovered Flying Bird (1fps) HFV sequence for different recovery algorithms Snapshots of recovered Airbag Explosion 2 (2fps) HFV sequence for different recovery algorithms Snapshots of recovered Cutting Apple (2fps) HFV sequence for different recovery algorithms Snapshots of recovered Water Drop (2fps) HFV sequence for different recovery algorithms xiii

15 Chapter 1 Introduction High speed photography enables investigations of high speed physical phenomena like explosions, collisions, animal kinesiology, and etc. High speed cameras find many applications in science, engineering research, safety studies, entertainment and defense [1]. Compared with conventional (low-speed) cameras, high frame rate video (HFV) cameras are very expensive. Despite their high costs, HFV cameras are still limited in obtainable joint temporal-spatial resolution, because current fast mass data storage devices (e.g., SSD) do not have high enough write speed to continuously record HFV at high spatial resolution. In other words, HFV cameras have to compromise spatial resolution in quest for high frame rate. For instance, the HFV camera Phantom v71 of Vision Research can offer a spatial resolution of at 753 frame per second (fps), but it has to reduce the spatial resolution to when operating at 2156 fps. This trade-off between frame rate and spatial resolution is forced upon by the mismatch between ultra-high data rate of HFV and limited bandwidth of in-camera memory. In addition, application scenarios exist when the raw shutter speed is restricted by low illumination of the scene. Needless to say, these problems are aggravated if high spectral resolution of HFV is also desired. No matter how 1

16 sophisticated sensor and memory technologies become, new, more exciting and exotic applications will always present themselves that require imaging of ever more minuscule and subtle details of object dynamics. It is, therefore, worthy and satisfying to research on camera systems and accompanying image/video processing techniques that can push spatial, temporal and spectral resolutions to the new limit. One way of breaking the bottlenecks for extreme imaging of high temporal-spatialspectral fidelity is to use multiple cameras. In this thesis we propose a novel multicamera coded video acquisition system that can capture a video at very high frame rate without sacrificing spatial resolution. In the proposed system, K component cameras are employed to collectively shoot a video of the same scene, but each camera adopts a different digitally modulated exposure pattern called coded exposure or strobing in the literature. Coded exposure is a technique of computational photography which performs multiple exposures of the sensors in random during a frame time [2 5]. In our design of HFV acquisition by multiple coded exposures, the sequence of target HFV frames is partitioned into groups of T consecutive target frames each. Each of the K component cameras of the system is modulated by a random binary sequence of length T to open and close its shutter, and meanwhile the pixel sensors cumulate charges. The camera only reads out cumulated sensor values once per T target frames. In net effect, this coded acquisition strategy reduces the memory bandwidth requirement of all K cameras by T folds. Every T target frames are mapped by each of the K cameras to a different coded exposure image that is the result of summing some randomly selected target (sharp) HFV frames. The objective is to use these K coded exposure images to recover the corresponding T consecutive HFV frames by exploiting both spatial and temporal sparsities of the HFV video signal 2

Exposure Time Temporal Resolution of Cameras Captured Random Measurements Recovery Algorithm Temporal Figure 1.1: The schematic of proposed multiple coded exposure cameras.

minimization problem. The architecture of the coded HFV acquisition system is depicted in Fig. 1.

It recovers the HFV signal from a far smaller number of measurements than the total number of pixels of the video sequence.

17 Exposure Time Temporal Resolution of Cameras Captured Random Measurements Recovery Algorithm Temporal Figure 1.1: The schematic of proposed multiple coded exposure cameras. Resolution of Recovered Signal and by solving a large-scale l 1 minimization problem. The architecture of the coded HFV acquisition system is depicted in Fig Relation to Compressive Sensing The proposed multi-camera coded video acquisition approach is a way of randomly sampling the video signal independent of signal structures. It recovers the HFV signal from a far smaller number of measurements than the total number of pixels of the video sequence. In this spirit of reduced sampling, the proposed coded HFV acquisition approach is similar to compressive sensing (CS) [6, 7]. However, our research is motivated not because the spatial resolution of the camera cannot be made sufficiently high, as assumed by CS researchers in their promotion of the single-pixel camera concept, rather because no existing mass storage device is fast enough to accommodate the huge data throughput of high-resolution HFV. Granted, the proposed coded HFV acquisition approach in general does not satisfy the restricted isometry property (RIP) of CS. In general, it needs more number of measurements than what is sufficient for CS coded acquisition, to recover the underlying 3D video signal. But the CS sampling methodology is, in general, unsuitable 3

18 for acquiring HFV of complex scenes. The reason is that the classic CS rely on dense random Gaussian and Bernoulli matrices, that is, dense random matrices with independent identically distributed (i.i.d.) entries having a standard normal or Bernoulli distribution to capture measurements. While these dense random matrices satisfy the properties imposed by CS theory with the optimal bound on the number of measurements (the bound on the number of measurements cannot get any smaller) [8], it is not practical to use such dense random measurement matrices in many applications, due to storage limitations, computational and/or physical constraints. As an example, consider the CS camera proposed in [9] which uses dense random measurement matrices. This camera sequentially makes random measurements of a single image (i.e., a frame in our case) one at a time. As such, the CS camera acquires an image in a duration that is orders of magnitude longer than conventional cameras, and it is useless for high-speed imaging. Therefore for video acquisition, applying a dense random measurement matrix to the 3D data volume of an entire video incurs both high time and space complexities. For this reason, many CS based video acquisition systems use what is called coded exposure to take random measurements of pixels either in a temporal neighborhood [1 14] or a spatial neighborhood [15, 16]. Such limitations is not exclusive to image/video acquisition applications. For example, in some distributed sensing systems, each sensor can only have access to part of signal being acquired due to communication and/or environmental constraints [17]. For hyper-spectral imaging, applying dense random measurement matrix to the entire signal requires simultaneous multiplexing in the spectral and spatial dimensions, which is a challenge with current optical and spectral modulators [18, 19]. These constraints make the measurements to depend on a subset of signal elements 4

19 rather than the entire signal, i.e. each measurement is acquired via a local measurement operator [17]. This yields to a new class of measurement matrices, known as structured random measurement matrices, including sub-sampled bounded orthonormal systems [2, 21], partial random circulant [22, 23] and Toeplitz [24] matrices (for acquisition systems involving convolution) and random block diagonal measurement matrices [17] where only diagonal blocks contain random values. The propose HFV acquisition system acquires local measurements along temporal dimension of the HFV signal. The measurement matrices of the proposed coded exposure schemes are random block diagonal matrices rather dense ones. In Chapter 3 we prove that these measurement matrices satisfy the RIP. However, the number of required measurements depends on certain properties of the basis in which the HFV signals are sparse. Specifically, we prove that measurement matrices of pixel-wise and column-row-wise coded exposure schemes satisfy the RIP with approximately the same number of measurements (within log factor) required in a dense random measurement matrix despite having many fewer random entries. 1.2 Related Works The idea of using multiple cameras to image high-speed phenomena was pioneered by Muybridge. He used multiple triggered cameras to capture high speed motion of animals [25]. Ben-Ezra and Nayar [26], combined a high spatial resolution still camera and a high-speed but low resolution video camera to achieve high temporal and spatial resolution. Wilburn et al. [1], proposed a multi-camera HFV acquisition system. They used a dense array of K cameras of frame rate r to capture high speed videos of frame rate h = rk. The idea is to stagger the start times of the exposure durations of 5

20 these K cameras by 1/h. The captured frames are then interleaved in chronological order to generate HFV. The drawback of this approach is that it is light-inefficient as the cameras exposure time is set to 1/h. Shechtman et al. [27], developed a superresolution technique that fuses information from multiple low resolution videos of the same scene to construct a video sequence of high space-time resolution. Compared to [1], this method has a larger exposure time, allowing more light to be collected. However, the large exposure time acts as a box-filter which destroys high temporal frequencies of the captured signal. Agrawal et al. [28], proposed a coded sampling technique for temporal super-resolution. K cameras of frame rate r are used, each of which captures a different linear combination of K HFV frames in time duration 1/r. The K HFV frames are then recovered from the K captured frames, by solving a system of linear equations. Compared to [1], the exposure time of this approach is increased by factor of K/2 and the linear system is made invertible by employing a sampling strategy based on S-matrices. But this method can achieve a frame rate no higher than rk, just as in [1]. In contrast, the coded acquisition scheme proposed here aims to recover HFV signals at frame rate much higher than rk, and it does so with much less motion blur than the method of [27], thanks to the broad-band property of temporal random sampling. Prior arts on HFV acquisition also include a class of methods that employ a single camera of coded exposure to acquire and recover HFV. Veeraraghavan et al. [29] proposed an interesting technique to capture periodic videos by a single camera of coded exposure. Their success relies on the strong sparsity afforded by the signal periodicity. Gupta et al. [3] and Bub et el. [31] adopt a type of random sampling which is classified by this paper as pixel-wise coded exposure; namely, different pixels of the camera 6

21 are set on and off in time according to different random binary sequences. The two methods are designed to capture HFV signals at expense of lower spatial resolution, and they also allow post-capture trade-off between spatial and temporal resolutions; the trade-off is flexible in [3] and fixed in [31]. Gu et al. [32] proposed a coded rolling shutter architecture for CMOS image sensors that allows spatio-temporal resolution trade-off. The authors used cubic and bidirectional interpolation as well as optical flow estimation to recover HFV signal. Hitmoi et al. [11] and Reddy et al. [12] proposed CS-based HFV acquisition techniques using a single camera of pixel-wise coded exposure, but they improved the performance of the previous methods by exploiting the sparsity of HFV signals more thoroughly. Specifically, data-dependent over-complete dictionary was used in [11]; sparse representation of spatial signal in wavelet domain and brightness constancy in temporal domain were used in [12]. Holloway et al. [13] proposed to acquire HFV with a camera that sums up randomly selected frames according to a binary sequence; this is the same approach adopted by the first paper on coded exposure [4], which we classify as frame-wise coded exposure to contrast with pixel-wise coded exposure. They used two different recovery algorithm: One based on the total variation of the spatio-temporal slices of the video (for higher speed) and the other based on data-dependent over-complete dictionary (for higher reconstruction quality). Unlike the above methods, which use temporal-multiplexing to realize coded exposure, the method [16] by Sankaranarayanan et al. adopts a method of spatial-multiplexing and recovers the HFV signal in two stages. At first stage a simple recovery algorithm is used to generate a low resolution preview signal; the preview signal is then used to estimate the motion of the full resolution video. In the next stage, resulting motion estimates are used to recover the full-resolution video by 7

22 a CS-based convex-optimization algorithm. Compared to the multi-camera approach, the above single-camera HFV acquisition techniques have the advantages of lower cost and no need for image registration. As ingenious as some of these single-camera coded exposure techniques are, their achievable frame rate is inherently limited by the maximum amount of information a single camera can possibly obtain under the hardware constraints, such as read-out time, memory bandwidth, minimum exposure time, etc. This is reflected by the fact that the test videos used by the above reviewed papers on single-camera HFV acquisition do not involve extreme high speed phenomena (e.g., explosion, crash). In order to achieve frame rate 1 fps and higher, which is the main objective of this research, it is necessary to employ multiple cameras. Properly registering captured frames prior to recovery is one of the challenges for multi-camera HFV acquisition systems. In this paper it is assumed that the scene is either relatively planar or is far away from the cameras so that the captured images can be aligned using projective transforms [28]. When this assumption is not valid more sophisticated registration methods are required. 1.3 Contribution A family of multi-camera coded HFV acquisition techniques with different tradeoffs between sampling efficiency and the system complexity are investigated. The simplest one is called frame-wise coded exposure, in which all pixels of each camera share the same binary modulation pattern. The other is pixel-wise coded exposure, in which each pixel is modulated by a different binary random sequence in time during the exposure. Pixel-wise coded exposure is superior to frame-wise coded exposure 8

23 in reconstruction quality given the number of measurements. However, the former requires considerably more complex and expensive control circuitry embedded in the pixel sensor array than the latter. To make hardware complexity manageable, we propose a column-row-wise coded exposure strategy that can match the performance of pixel-wise coded exposure but at a fraction of the cost. Theoretical analysis of proposed coded HFV acquisition schemes has been conducted. The resulting theorems explain the large gap in recovery performance between the frame-wise and the other two schemes. We prove that random measurement matrix of column-row-wise coded exposure satisfies Restricted Isometry Property with almost the same number of random measurements of pixel-wise coded exposure. This makes the column-row-wise coded exposure scheme an elegant HFV acquisition system that strike a good balance between the prowess of the coded exposure scheme and the complexity/cost of the required imaging sensors. The theoretical results are corroborated by our experimental findings. We also propose a coded exposure techniques which is called random coded exposure rolling shutter. Unlike the other coded exposure schemes which use global shutter, the random coded exposure rolling shutter uses a rolling shutter. Compared to other schemes, this one is simpler to implement on CMOS sensor arrays as it only required modifying the logic of address generator unit [32]. The proposed coded acquisition system has many advantages over existing HFV cameras. First, inexpensive conventional cameras can be used to capture high-speed videos. For example, Phantom Flex high speed video camera of Vision Research with temporal resolution of > 1fps costs between $5, and $15, while a consumer camera with 6fps and comparable spatial resolution costs less than $2. 9

24 This makes HFV camera systems more cost effective. Second, the new system does not compromise the spatial resolution of HFV, thanks to drastic shortening of sensor readout time amortized over all target frames by coded acquisition. Third, the proposed camera system architecture is highly scalable: more cameras (possibly high-speed ones) can be employed to achieve extremely high frame rate, which is otherwise unobtainable. This is very much analogous to the multiprocessor technology that improves the computation throughput when the speed of a single CPU cannot go any higher. Forth, the multi-camera coded acquisition system can capture HFV in low illumination conditions, which is a beneficial side effect of using low-speed cameras. Finally, the proposed coded exposure system can also be used to capture images without motion blur by recovering multiple (sharp) images at higher temporal resolution than what is achievable by an individual camera. We have also developed techniques to capture and recover color information of HFV signals. In particular, we capture color information by placing multi-band random color filter arrays (CFA) in front of sensor arrays. The recovery of color information is made possible by introducing additional inter-band sparsity constrains in the objective function of the recovery algorithm. The following publications were a result of work conducted during doctoral study: Xiaolin Wu, Reza Pournaghi, High Frame Rate Video Capture by Multiple Cameras with Coded Exposure, ICIP, pp , 21 Reza Pournaghi, Xiaolin Wu, Xianming Liu, Low Bit-rate Image Coding via Local Random Down-sampling, PCS, pp , 213 Reza Pournaghi, Xialoin Wu, Coded Acquisition of High Frame Rate Video, IEEE Transactions on Image Processing, vol.23, no.12, pp , 214 1

25 1.4 Outline of Thesis The reminder of this thesis is structured as follows. In the next section a brief introductory discussion about CS theory is presented which is required for the theoretical analysis of the proposed schemes. In the next chapter, we introduce the concept of coded exposure, and propose three HFV acquisition systems with multiple coded exposure cameras; frame-wise, pixel-wise and column-row-wise coded exposure schemes. In Chapter 3, CS based theoretical analysis of the proposed coded exposure systems is presented. In particular, we prove the RIP for measurement matrices of the three coded exposure schemes. In Chapter 4, a coded exposure system with random coded exposure rolling shutter is presented. Unlike the other coded exposure systems introduced in Chapter 2, this one uses electronic rolling shutter rather than global shutter. Therefore, it is simpler to implement as it does not require any hardware modification in the sensor array and can readily be implemented with standard CMOS sensors by altering address generator logic of the control unit of CMOS sensor array. We formulate the recovery of the HFV signal in the context of sparse analysis model in Chapter 5. Two recovery algorithms are presented in this chapter. The first one uses total variation (TV) and Laplacian operators to exploit the temporal and spatial sparsities of the signal. The second recovery algorithm, is a sparse coding algorithm that uses learned dictionary for sparse representation of the signal. In Chapter 6, we show that how the proposed coded exposure systems and the recovery algorithms can be modified to capture and recover color information of the HFV signal. In this chapter we also reveal an important side benefit of proposed sparsity-based HFV recovery algorithm and explains how it can be used to add a degree of randomness to 11

26 the measurements in spatial domain. This side benefit enables us to recover HFV signals without the need for perfect registration. Simulation results are reported in Chapter 7 and we conclude in Chapter Overview of Compressive Sensing Theory In this section a brief introductory discussion about CS theory and the conditions under which CS can recover compressible/sparse signals is presented. Before proceeding let us first introduce some notations that will be used throughout this thesis: We reserve the letters c, c 1, c 2, to represent universal positive constants. C is used to denote the set of complex numbers and R is used to denote the set of real numbers. For < p <, the l p -norm of a vector v C N is defined as ( N ) 1/p v p = v i p. (1.1) i=1 The infinity norm, v, is defined as v = max v i (1.2) j=1,,n and the quantity v is the number of nonzero elements in v, i.e. v = #{i : v i }. (1.3) The quantity v is usually called l -norm of vector v even though it is not a norm not even a quasi-norm. We reserve x to represent sparse vector and f to represent 12

27 a general vector (not necessarily sparse). Vector x is called sparse if it has only a few nonzero elements and is called S-sparse if it has at most S nonzero elements, i.e. x S. x is called nearly sparse if it has a few large elements and many zero or near zero ones. Signal f is said to have sparse (or nearly sparse) representation in basis Ψ if its transform coefficients x = Ψf is sparse (or nearly sparse). We use Ψ to denote conjugate transpose of complex matrix Ψ and use A to denote transpose of real valued matrix A. Let I {1, 2,, N} be a subset of indices. We denote by I the cardinality of the set I. By v I we mean a vector of length N obtained by setting the entries of v indexed by I c to zero, where I c is the complement of the set I in {1, 2,, N}. Similarly, by A I we mean an M N matrix obtained by setting the columns of A indexed by I c to zero. The null space (or Kernel) of matrix A is defined as ker(a) = {v : Av = }. (1.4) Compressive Sensing Framework The research on compressive sensing, also known as compressed sensing, compressive sampling or sparse sampling, was pioneered by Candes, Romberg and Tao [6] and Donoho [7]. It questions the wisdom of above outlined conventional signal acquisition systems and introduces a new data acquisition method that captures and compresses data simultaneously. The CS theory claims that under certain conditions, a signal can be reconstruct with high probability from far fewer measurements than what is required in traditional methods. The conditions on which CS relies are signal sparsity 13

28 or compressibility and incoherence of the measurements to the original signal. A signal is sparse if it has only a few nonzero elements. A signal is is compressible if it has a few large coefficients and many zero or near zero ones when represented in an appropriate basis. Putting differently, a signal is compressible if the sorted magnitudes of its coefficients in an appropriate basis decay quickly, typically like a power-law. Such compressible signals can be well approximated with sparse signals. Incoherence means that sampled signal which has (nearly) sparse representation in some basis, Ψ, should have spread out representation in the domain in which it is acquired. In other words, if A is the measurement matrix used to acquire the signal, then the rows {A i } of A should have an extremely dense representation in sparsity basis Ψ. Let f be the signal of interest that has S-sparse representation in orthonormal basis Ψ. CS capturing process, depicted in Fig. 1.2, can then be formulated as y = Af = AΨx = Ωx (1.5) where matrix A of size M N represents CS linear measurement process of capturing M N measurement from signal x (or f) and Ω = AΨ. The objective is to recover x from under-sampled measurements y assuming Ω is known. Once x is recovered, f can easily be obtained as f = Ψ x. Since M N, the recovery of x from y is of course an ill-posed inverse problem that has an infinite number of solutions. In fact, given y = Ωx, then x + v for any vector v in the null space of Ω is also a solution to (1.5). The recovery algorithm intends to find x in the (N M)-dimensional translated null space H = ker(ω) + x. The classical approach to this type of ill-posed inverse problem is to find a solution 14

29 f = Ψx y A Θ = AΨ Ψ x x Figure 1.2: CS sampling process for compressible signals with nearly sparse representation in orthonormal basis Ψ x with minimum energy via the following l 2 minimization problem x = arg min ˆx 2 subject to Ωˆx = y. (1.6) ˆx The above optimization problem has closed-form solution x = Ω (ΩΩ ) 1 y. However, it can almost never find a sparse solution but instead, finds a non-sparse x with many nonzero elements [33]. When x is sparse, a reasonable recovery algorithm is to search for the sparsest vector x that agrees with the M measurements in y. This can be achieved by solving the following l minimization x = arg min ˆx subject to Ωˆx = y. (1.7) ˆx The hope is that the solution to above minimization problem coincides the original signal of interest x. In fact, Donoho and Elad [34] proved that with some condition on Ω, the minimization problem (1.7) can uniquely recover original S-sparse signal x, given M 2S. Unfortunately, this minimization problem is NP-complete and 15

30 H H x x x x x (a) Set of 2-Sparse vectors in R 3 (b) l 1 Geometry (c) l 2 Geometry Figure 1.3: Geometry of l 1 and l 2 minimization in 3-dimensional space. (a) The set of all 2-Sparse vectors in R 3 is a highly nonlinear space consisting of all 2-dimensional hyperplanes that are aligned with the coordinate axes. (b) l 1 ball finds the desired sparse vector x from solution space H. (c) l 2 ball finds a solution with many nonzero elements instead of the sparse vector. numerically unstable. An alternative approach [35] is to substitute the l norm with the closest convex norm, which is l 1 norm, and solve the following computationally tractable linear problem called basis pursuit x = arg min ˆx 1 subject to Ωˆx = y, (1.8) ˆx which can indeed recover sparse signals due to the shape of l 1 ball. To help visualize why l 2 minimization (1.6) cannot find the sparse solution that can be recovered via l 1 minimization (1.8), we draw in Fig. 1.3 the the geometry of CS problem in 3-dimensional space. In Fig. 1.3, x is the desired sparse vector, which is aligned with the coordinate axes, and x is the solution to the minimization problem. The plane H is the solution space for (1.5), that is H = ker(ω) + x. Fig. 1.3b shows the l 1 minimization process. The l 1 ball is an octahedron that contains all ˆx R 3 such that ˆx(1) + ˆx(2) + ˆx(3) r where r is the radius of the l 1 ball. One can 16

31 imagine l 1 minimization process as blowing the octahedron (l 1 ball) by gradually increasing its radius. The first intersection of l 1 ball and the solution plane H is the solution to the l 1 minimization problem. Since the l 1 ball has its points aligned with the coordinate axes, its first contact with the solution space H will be at a point near the coordinate axes. This is where the sparse vector x is located. Fig. 1.3c shows the l 2 minimization process. The l 2 ball of radius r, contains all ˆx R 3 such that ˆx(1) 2 + ˆx(2) 2 + ˆx(3) 2 < r 2. One can imagine l 2 minimization process as blowing the sphere (l 2 ball) by gradually increasing the radius. The first intersection between the l 2 ball and solution plane H is the solution to the l 2 minimization problem. Depending on the orientation of the solution plane H, the solution to l 2 minimization problem will most probably be away from the coordinate axes. Therefore, it will neither be sparse nor be close to the correct sparse vector x. In the next section, we introduce incoherence conditions for measurement matrices, which when are satisfied, guarantee the recovery of sparse signal x via solving the l 1 minimization problem (1.8) Conditions for Sparse Recovery One of the most widely used properties of matrix Ω in analyzing the uniqueness of the l 1 minimization problem (1.8), is the so-called null space property (NSP). It was first introduced in [36] and is defined as follows: Definition 1. Let Ω be an M N matrix. Then Ω satisfies null space property of order S, if for all index sets I {1, 2,, N} with I S, it holds v I 1 v I c 1 for all v N (Ω)\{}. (1.9) 17

32 The following theorem which is based on the above notation guarantees unique solvability of the l 1 minimization (1.8). Theorem 1 ([36]). Let Ω be an M N matrix. Then every S-sparse vector x is the unique solution to the l 1 minimization problem (1.8) with y = Ωx, if and only if, Ω satisfies NSP of order S. The NSP is both necessary and sufficient to gaurantee recovery of S-sparse signal via (1.8). It can also handle nearly sparse signals as well [36]. However, NSP is usually difficult to show directly [2] and it does not account for measurement noise [8]. Since in any real application, measurements are corrupted by some amount of noise such as quantization, it is important that CS remain robust in such noisy environments. In the presence of noise, the CS measurement process and the l 1 minimization problem are y = Ωx + n (1.1) and x = arg min ˆx 1 subject to Ωˆx y 2 σ (1.11) ˆx where n in (1.1) is the error term and σ in (1.11) is the variance of the measurement error. It is clear that when measurements are contaminated by error, it is no longer possible to guarantee uniqueness. However, it is desirable to consider stronger conditions that are tolerant to the error. At the very least, small perturbations in the data should cause small perturbations in the reconstruction [37]. Candes et al. [6] 18

33 introduced the following matrix property called restricted isometry property (RIP) that has proved to be very useful to study the general robustness of CS. Definition 2. For each integer S = 1, 2,, define the isometry constant δ S of matrix Ω as the smallest number such that (1 δ S ) x 2 2 Ωx 2 2 (1 + δ S ) x 2 2 (1.12) holds for all S-sparse vectors x. We loosely say that matrix Ω satisfies RIP of order S, if δ S is not too close to 1. This property allows matrix Ω to approximately preserve the Euclidean length of S-sparse signals. This in turn implies that S-sparse vectors cannot be in the null space of Ω which is necessary as otherwise there would be no hope of recovering them. The following theorem, which is proposed by Candes [37], provides a sufficient condition for stable recovery of the l 1 minimization problem (1.11). It works with all kind of signals (exactly sparse signals as well as nearly sparse ones) and handles noise gracefully. Theorem 2. Let x be a nearly sparse signal and x S be the S-sparse approximation of x, i.e., x S is the vector x with all but the largest S components set to. Assume for Ω we have δ 2S 2 1. Then the solution x to the l 1 minimization problem (1.8) obeys the following inequalities: x x 2 c x x S 2 S + c 1 σ. (1.13) 19

34 In the above theorem the reconstruction error is bounded by the sum of two terms. The first term is the error due to S-sparse approximation of the nearly sparse signal x and the second one is proportional to the noise level. If the measurements are noise-free and the signal is S-sparse, then Theorem 2 asserts that the recovery is exact. If the measurement are still noise-free but the signal is nearly sparse, then above theorem asserts that the quality of recovered signal x would be as good as the S-sparse approximation of the original signal. Finally, since the constants in Theorem 2 are typically small (with δ S = 1/4 we have c 5.5 and c 1 6 [37]), Theorem 2 asserts that small perturbations in the data causes small perturbations in the reconstruction Designing a Stable Measurement Matrix In the previous section, we show that a sufficient condition for stable and robust recovery of (nearly) sparse signals at the presence of noise is that matrix Ω = AΨ satisfies RIP with isometry constant δ 2S 2 1. What remains to be done is to design a measurement matrix that satisfies RIP. We start with the case where the signal itself is sparse, i.e. Ψ = I (identity matrix) and then extend the results to the more general case. When Ψ = I, we have Ω = A. The goal is then to design a measurement matrix A which satisfies RIP of order 2S. Direct construction of measurement matrix A that satisfies RIP of order 2S, requires verifying (1.12) for each of the ( N 2S) combinations of 2S nonzero elements of vector x which is not practical for large signals. This is where randomness comes into play as it is much easier to verify (1.12) for random matrices using probabilistic methods. Baraniuk et. al [38] show that Gaussian and Bernoulli 2

35 random matrices satify RIP with high probability. The Gaussian measurement matrix of [38] is a random matrix of size M N whose entries are i.i.d. random variables drawn from a Gaussian distribution with zero mean and variance 1/M. The Bernoulli measurement matrix of [38] is a random matrix of size M N whose entries, φ i,j, are i.i.d. random variables drawn from a symmetric Bernoulli distribution, i.e. Pr(φ i,j = ±1/ M) = 1/2. Based on their work we can derive the following theorem: Theorem 3. Let A be an M N random Gaussian or random symmetric Bernoulli matrix as defined above. Matrix A satisfies RIP of order S with isometry constant < δ S < 1 with probability > 1 2 exp( c 1 M) provided that M c 2 S log(n/s) where S is the sparsity level of the signal; c 1 and c 2 are constants depending only on δ S. Above theorem asserts that when the signal of interest is (nearly) sparse in canonical basis, i.e. Ψ = I, the l 1 minimization problem (1.11) can recover a good approximation of signal x with high probability, with the approximation error bounded by (1.13), given M = O(S log(n/s). In most real applications, however, the signal itself is not (nearly) sparse but has a (nearly) sparse representation in some basis Ψ I. In such cases, we would like the RIP to hold for the matrix AΨ. Baraniuk et al. in [38] proved that given Ψ is an arbitrary but fixed orthonormal basis and measurement matrix A is a random Gaussian or a random Bernoulli matrix, the matrix Ω = AΨ satisfies RIP with the same probability and the same number of measurements as A does. This property, which is usually called universality, is one of the important advantages of using dense random matrices to construct matrix A. It makes sampling process extremely easy; just take M N random measurements from the signal of 21

36 interest using random Gaussian or Bernoulli distribution and with very high probability, a good approximation of the original signal can be recovered. All needs to be known is that the signal has a nearly sparse representation in some orthonormal basis. The sparsity basis does not need to be known during sampling process. It is only used during recovery process and one can even use different sparsity basis with the same measurements as the random Gaussian and Bernoulli are incoherent with any fixed orthonormal sparsity basis. 22

37 Chapter 2 Multi-camera Coded Exposure Systems In this chapter, frame-wise, pixel-wise and column-row-wise coded exposure schemes will be discussed in details. The proposed coded exposure schemes employ K 2 cameras, each of which acquire local measurements along temporal dimension of the 3D data volume. Unlike regular cameras where the senor array keeps accumulating light during the exposure time, each camera of the proposed coded exposure systems use pseudo-random binary sequences of length T to open and close electronic shutter of the sensor array during exposure time. The objective is to use the random measurements captured by the cameras to recover multiple images of the 3D data volume at higher temporal resolution than what is dictated with camera frame rate. In other words, if r is the camera frame rate, temporal resolution of the recovered 3D data volume will be rt with T being number of target frames to be recovered. The recovery, which is discussed in Chapter 5, is made possible by exploiting both spatial and temporal sparsities of the 3D data volume and by solving a large-scale l 1 minimization problem. Three coded video acquisition techniques of varied trade 23

38 offs between performance and hardware complexity are developed: frame-wise coded acquisition, pixel-wise coded acquisition, and column-row-wise coded acquisition. 2.1 Frame-wise Coded Exposure In K-camera frame-wise coded exposure system, camera k, 1 k K, opens and closes its shutter according to a binary pseudo random sequence [ ] b k = b k (1), b k (2),, b k (T ). (2.1) Through the above coded exposure process, camera k produces a coded frame I k out of every T target frames f 1, f 2,, f T. The coded (blended, blurred) image I k is a function of the corresponding T target (sharp) frames f t s I k = T b k (t)f t. (2.2) t=1 Let vector f u,v = f 1 (u, v) f 2 (u, v). f T (u, v) (2.3) be the time series of pixels at spatial location (u, v). Camera k modulates the time signal f u,v with b k, and makes a random measurement of f u,v y k (u, v) = b k, f u,v + n k (u, v), (2.4) 24

39 where n k (u, v) is the measurement noise of camera k at pixel location (u, v). Let vector y u,v = y 1 (u, v) y 2 (u, v). y K (u, v) (2.5) be the K random measurements of f u,v made by the K coded exposure cameras. The measurement vector y u,v of f u,v is y u,v = Bf u,v + n (2.6) where B is the K T binary measurement matrix made of K binary pseudo random sequences, i.e., row k of matrix B is the coded exposure control sequence b k for camera k, and n is the measurement error vector. Let the width and height of the video frame be N x and N y. For denotation convenience the three dimensional T N x N y pixel grid of T target frames is written as a super vector f, formed by stacking all the T N x N y pixels in question. In our design, K synchronized cameras of coded exposure are used to make K N x N y measurements of f. Let y be the vector formed by stacking all N x N y K-dimensional measurement vectors y u,v, 1 u N x and 1 v N y. Then, the K-camera frame-wise coded acquisition of HFV can be stated as y = A f f + n (2.7) 25

1 1 b k Sensor Array Exposure Pattern of Camera k HFV Target Frames + + + = Frame 1 Frame 2 Frame T Captured Frame Figure 2.1: Capturing process of a camera of frame-wise coded exposure system.

40 1 1 b k Sensor Array Exposure Pattern of Camera k HFV Target Frames = Frame 1 Frame 2 Frame T Captured Frame Figure 2.1: Capturing process of a camera of frame-wise coded exposure system. where A f is a K N x N y T N x N y block diagonal matrix N xn y {}}{ B K T... K T K T B... K T A f =... K T K T... B (2.8) in which B is a K T matrix whose elements are drawn from i.i.d. symmetric binary distribution. The capturing process of a camera of frame-wise coded exposure system is depicted in Fig

41 The frame-wise coded exposure is mostly motivated by the ease of hardware implementation; its ability to recover HFV signals is limited because in A f, all spatial pixels of a camera share the same temporal modulation sequence and thus have a high degree of correlation. 2.2 Pixel-wise Coded Exposure In order to recover higher quality images of the 3D data volume using multi-camera coded video acquisition, we propose pixel-wise coded exposure which lets temporal random exposure sequences vary in spatial pixel locations and vary for different cameras. The random exposure sequence of length T for camera k and pixel location (u, v) is denoted by b k u,v. In other words, camera k modulates the time signal f u,v with the random binary sequence b k u,v, and generates a random measurement of f u,v y k (u, v) = b k u,v, f u,v + n k (u, v). (2.9) In contrast, the frame-wise coded exposure of (2.4) imposes the same random exposure pattern b k on all pixels (u, v). In the new pixel-wise coded exposure, the measurement vector y u,v of f u,v becomes y u,v = B u,v f u,v + n (2.1) where B u,v is the K T binary measurement matrix made of K binary pseudo random sequences, row k being b k u,v, 1 k K. Therefore, the measurement matrix A p for 27

pixel-wise coded exposure, where y = A p f + n, is A p = B 1,1 K T... K T K T B 2,1... K T... (2.11) K T K T.

Frame Figure 2.2: Capturing process of a camera of pixel-wise coded exposure system.

2, ensures that all acquired pixel values are independent random measurements of the HFV signal f; thus it is superior to

42 pixel-wise coded exposure, where y = A p f + n, is A p = B 1,1 K T... K T K T B 2,1... K T... (2.11) K T K T... B Nx,N y b k u,v Sensor Array Exposure Pattern of Camera k HFV Target Frames = Frame 1 Frame 2 Frame T Captured Frame Figure 2.2: Capturing process of a camera of pixel-wise coded exposure system. The above design of multi-camera pixel-wise coded exposure, which is depicted in Fig. 2.2, ensures that all acquired pixel values are independent random measurements of the HFV signal f; thus it is superior to frame-wise coded exposure in the reconstruction of f. Rigorous evaluations of the performances of these two schemes can be found in Chapter 3. However, from the frame-wise to pixel-wise coded exposure 28

43 the granularity of pixel shutter control jumps by O(N x N y ) folds. This makes the complexity and cost of the latter sensor design drastically higher than the former. In order to drive individual pixels with different random sequences, N x N y separate signal paths are required to connect each pixel to the binary random generator. These signal paths are embedded in CMOS metal layers. Since the CMOS technology allows only a limited number of metal layers, each of which can contain a limited number of signal paths, pixel-wise coded exposure becomes infeasible even for modest spatial resolution. An alternative is to store the binary random sequence b k u,v of length T at each pixel location. But this approach has two drawbacks: 1) the fill factor of the pixels is greatly reduced and 2) with a fixed b k u,v the camera can only capture HFV signals at frame rate rt, where r is the camera frame rate. 2.3 Column-row-wise Coded Exposure To simplify the shutter control mechanism of pixel-wise coded exposure, while having the same HFV recovery capability in theory and practice, we introduce a new multicamera random sampling scheme called column-row-wise coded exposure. In columnrow-wise coded exposure, the temporal random modulation of an N x N y pixel array is controlled by N x + N y rather than N x N y random binary sequences as required by pixel-wise coded exposure. Specifically, for camera k, N y random binary sequences of 29

44 length T, denoted by rv(1) k r k r k v(2) v =, 1 v N y, (2.12) rv(t k ) are used to control N y rows of pixels; N x random binary sequences of length T, denoted by c k u(1) c k c k u(2) u =, 1 u N x, (2.13) c k u(t ) are used to control N x columns of pixels. The row and column binary control signals jointly produce the binary random coded exposure sequence for camera k at pixel location (u, v): [ ] φ k u,v = φ k u,v(1), φ k u,v(2),..., φ k u,v(t ) (2.14) where φ k u,v(t) = r k v(t) c k u(t), 1 t T, (2.15) is the exclusive OR operator. The measurement vector y u,v of f u,v for the columnrow-wise coded exposure is y u,v = Φ u,v f u,v + n (2.16) where Φ u,v is the K T binary measurement matrix with row k being φ k u,v, 1 k K. Therefore, the measurement matrix A cr for column-row-wise coded exposure, where 3

45 c k u r k v φ k u,v Sensor Array Exposure Pattern of Camera k HFV Target Frames = Frame 1 Frame 2 Frame T Captured Frame Figure 2.3: Capturing process of a camera of the column-row-wise coded exposure system. y = A cr f + n, is A cr = Φ 1,1 K T... K T K T Φ 2,1... K T... (2.17) K T K T... Φ Nx,N y Capturing process of a camera of column-row-wise coded exposure system is depicted in Fig Fig. 2.4 is the sketch of a CMOS implementation of the columnrow-wise coded exposure by our colleague Dadkhah [39]. In this design, only N x + N y control signals are needed to realize random coded exposures of N x N y pixels. The control signals are generated during exposure time using the row and column linear feedback shift registers (LFSRs) which are places 31

XOR Column LFSR M 14 M 12 M 11 Row LFSR M 15 M 13 M 1 Pixel Pixel Pixel Pixel V dd M 2 M V 1 p C2 Column LFSR M 3 Reset Row Select Row Addressing Circuit Pixel Pixel Pixel Pixel Pixel Pixel Pixel

4: Schematic design of column-row-wise coded exposure outside of the pixel array. At each pixel only a compact XOR block rather than a T - bit memory is required.

46 XOR Column LFSR M 14 M 12 M 11 Row LFSR M 15 M 13 M 1 Pixel Pixel Pixel Pixel V dd M 2 M V 1 p C2 Column LFSR M 3 Reset Row Select Row Addressing Circuit Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel Row LFSR Op-Amp M 9 V o M 5 M 7 Pixel Pixel Pixel Pixel M 4 V dd M 6 M 8 V + (a) Individual Pixel Column Addressing Circuit (b) Pixel Array Figure 2.4: Schematic design of column-row-wise coded exposure outside of the pixel array. At each pixel only a compact XOR block rather than a T - bit memory is required. This drastically reduces the complexity and increases the fill factor. Moreover, as the external binary random sequences in both row and column directions can vary both in frequency and value, the user can set the capture frame rate at will. We refer the reader to Dadkhah s PhD thesis [39] for the implementation details. In the next chapter, we prove that the pixel-wise and column-row-wise coded exposures asymptotically require the same number of random measurements to recover the 3D data volume with the same temporal resolution. This theoretical result is corroborated by our simulation results in Chapter 7, further establishing the viability of the multi-camera HFV acquisition system based on column-row-wise coded exposure. The recovery algorithms are discussed in Chapter 5. 32

47 Chapter 3 Theoretical Analysis of the Coded Exposure Systems The coded exposure systems, proposed in the previous chapter, are based on CS theory which asserts that if the signal is sparse (or compressible) and the sensing matrix is incoherent with the sparsity basis, one can recover the original signal (or a good approximate of it) from a number of measurements that is far fewer than the signal length. In this chapter, CS based theoretical analysis of the proposed framewise, pixel-wise and column-row-wise coded exposure systems will be discussed. 3.1 RIP of Measurement Matrices of the Coded Exposure Systems The random measurement matrices of the proposed HFV acquisition systems (A f, A p and A cr ), are random block diagonal matrices with the elements of diagonal blocks drawn from random Bernoulli distribution. Compared to dense unstructured random matrices, these structured random matrices require more measurements to satisfy 33

48 the RIP and lack the universality property. Also in general, proving the RIP for these matrices requires analytic tools beyond the elementary approaches that suffice for unstructured random matrices [17]. Application of such structured measurement matrices is not limited only to our case. As stated in [4] there are different signal acquisition settings that require such matrices either because of architectural constraints or in order to reduce computational cost of the system. For example in a network of sensors, communication constraints may force the measurements taken by each sensor dependent only on the sensor s own incident signal rather than signals from all of the sensors [41]. For video acquisition, applying a dense random measurement matrix to the 3D data volume of an entire video incurs both high time and space complexities. A more practical approach is to take random measurements of pixels either in a temporal neighborhood (our case) or a spatial neighborhood (e.g., a fixed frame as in [15]). In these cases, the random measurement matrix is block diagonal rather than being a dense random matrix. Eftekhari et. al. noticed the importance of such structured random measurement matrices and derived the RIP for M N random block diagonal matrices of the form A 1 K T K T K T A 2 K T A =, (3.1)... K T K T A J where J is the number of diagonal blocks, M = KJ, N = T J. Each diagonal block A j, 1 j J, is a K T matrix with entries populated with i.i.d. sub-gaussian random variables with zero mean and standard deviation 1/ K. They considered two 34

49 types of random block diagonal matrices. The first type, which is called distinct block diagonal (DBD) matrices, are random block diagonal matrices with distinct blocks, i.e. {A j } J j=1 are distinct and independent of each other. The second type, which is called repeated block diagonal (RBD) matrices, are random block diagonal matrices with repeated blocks, i.e. A j = A j for 1 j, j J. In their theoretical analysis, rather using the general definition of RIP (Definition 2), they used the following more convenient notion of the Ψ-RIP. Definition 3. Let Ψ denote an orthonormal basis in which signal f has sparse representation. For each integer S = 1, 2,, define the isometry constant of matrix A in the basis Ψ, δ S = δ S (A, Ψ), as the smallest number such that (1 δ S ) f 2 2 Af 2 2 (1 + δ S ) f 2 2 (3.2) holds for all f with Ψf S. The results in [17] also depend on two properties of the sparsity basis Ψ, called coherence and block-coherence, which are defined as follows. Definition 4 ([37]). The coherence of an orthonormal basis Ψ C N N is defined as µ(ψ) = N max Ψ(i, j) (3.3) i,j where Ψ(i, j) is the entry of Ψ at row i and column j. A few more definitions are required before we can define block-coherence. For an 35

50 N N orthonormal basis Ψ with N = T J define Ψ j C N T such that Ψ = [Ψ 1, Ψ 2,, Ψ J ]. (3.4) Also for all x C N, define X(x, Ψ) C T J as X(x, Ψ) = [Ψ 1x, Ψ 2x,, Ψ J x ]. (3.5) The block-coherence of Ψ can now be defined as follows. Definition 5 ([17]). Let {e n } N n=1 be canonical unit vectors for C N, i.e., n th entry of e n is one and the rest are zeros. The block-coherence of orthonormal basis Ψ C N N is γ(ψ) = J N max n=1 X(e n, Ψ) 2. (3.6) In words, γ(ψ) is proportional to the maximal spectral norm when any column of Ψ is reshaped into T J matrix. As for the range of µ(ψ) and γ(ψ), it follows from linear algebra that 1 µ(ψ) N (3.7) and since every column of Ψ has unit l 2 -norm, it is easy to verify 1 γ(ψ) J. (3.8) Using above notation, Eftekhari et al. proved that random DBD matrices satisfy 36

51 Ψ-RIP with high probability given M = O ( µ 2 (Ψ) S log 2 S log 2 N ). (3.9) They also proved that random RBD matrices satisfy Ψ-RIP with high probability given M = O ( γ 2 (Ψ) S log 2 S log 2 N ). (3.1) In the next section we use the results in [17] to prove the Ψ-RIP for random block diagonal measurement matrices of frame-wise, pixel-wise and coded exposure schemes RIP for Frame-wise and Pixel-wise Coded Exposure Systems The measurement matrices of coded exposure schemes are random block diagonal matrices with J = N x N y diagonal blocks. However, the entries of diagonal blocks are not sub-gaussian random variables with zero mean and standard deviation 1/ K. In order to satisfy the sub-gaussian requirement of [17], we define new random measurement matrices Ȧf, Ȧ p and Ȧcr which are generated by replacing the 1s with 1/ K and s with 1/ K in the diagonal blocks of A f, A p and A cr respectively. Let ẏ be the random measurement vector of signal f, generated with any of the new random measurement matrices, i.e. ẏ = Ȧf where Ȧ is any of the Ȧf, Ȧ p or Ȧcr. It is easy to verify that for camera 1 k K at pixel location (u, v) we have ẏ k (u, v) = 2y k(u, v) y dc (u, v) K 37

52 where y dc (u, v) is the DC component of signal f(u, v) in time at location (u, v) and y k (u, v) is the original random measurement captured by camera k at location (u, v). The DC component of the original signal can be captured by adding another camera which is always exposed to the light during each exposure time (all entries of the diagonal blocks of the measurement matrix of this camera are set to 1). The new random measurement vector and matrix of coded exposure schemes have the same information as the original ones while satisfying the sub-gaussian requirements of [17], i.e. the diagonal blocks of the new random measurement matrix of coded exposure schemes are populated from i.i.d. symmetric Bernoulli distribution, a sub-gaussian distribution, with zero mean and standard deviation 1/ K. The results in [17] treat signal f as a 1-dimensional signal which has a (nearly) sparse representation in an orthonormal basis Ψ. However here, the original signal f is a vector representation of a 3-dimensional space time volume. As such, we assume Ψ 3D is the matrix representation of a 3D transform. Specifically, the sparsity basis Ψ 3D is Ψ 3D = Ψ Y Ψ X Ψ T (3.11) where Ψ Y C Ny Ny is the 1D transform matrix in vertical direction, Ψ X C Nx Nx is 1D transform matrix in horizontal direction, Ψ T C T T is 1D transform matrix in temporal direction and is the Kronecker product. It is easy to verify that the coherence of Ψ 3D is µ(ψ 3D ) = N max Ψ t,t,u,u,v,v T (t, t ) Ψ X (u, u ) Ψ Y (v, v ) (3.12) where N = T N x N y, 1 t, t T, 1 u, u N x and 1 v, v N y. 38

53 As for the block-coherence of Ψ 3D, recall that γ(ψ 3D ) is proportional to the maximal spectral norm, when any column of Ψ 3D is reshaped into T J = N xn y matrix. Let Ψ T,t, Ψ X,u and Ψ Y,v, denote columns t, u and v of Ψ T, Ψ X and Ψ Y, respectively. It is easy to verify X(e n, Ψ 3D ) = Ψ T,t ( Ψ Y,v Ψ X,u ) (3.13) where n = (v 1)T N x + (u 1)T + t, and Ψ Y,v and Ψ X,u are transpose of vectors Ψ Y,v and Ψ X,u, without taking complex conjugate of the entries, respectively. We then have X(e n, Ψ 3D ) 2 = X(e n, Ψ 3D ) X(e n, Ψ 3D ) (3.14) = ( Ψ T,t (Ψ Y,v Ψ X,u ) ( Ψ T,t (Ψ Y,v Ψ X,u ) = (ΨY,v Ψ X,u )Ψ T,t Ψ T,t (Ψ Y,v Ψ X,u ) = (ΨY,v Ψ X,u )(Ψ Y,v Ψ X,u ) = (Ψ Y,v Ψ Y,v ) (Ψ X,u Ψ X,u (3.15) (3.16) (3.17) (3.18) (3.19) where in passing from (3.16) to (3.17), we used the fact that Ψ T is orthonormal and (3.18) follows from the matrix-product property of the Kronecker product. Now denote by λ max Y,v and λmax X,u the maximum magnitude of eigenvalues of Ψ Y,v Ψ Y,v and Ψ X,u Ψ X,u, respectively. From the Kronecker product property for eigenvalues we get X(e n, Ψ 3D 2 = λ max Y,v λ max X,u. (3.2) 39

54 Since matrices Ψ Y,v Ψ Y,v and Ψ X,u Ψ X,u are rank one, their only nonzero eigenvalues are Ψ Y,v 2 = 1 and Ψ X,u 2 = 1, respectively. Therefore, we have X(e n, Ψ 3D ) = λ max Y,v λ max X,u = 1. (3.21) The block-coherence of Ψ 3D is then γ(ψ 3D ) = N x N y N max n=1 X(e n, Ψ 3D ) 2 = N x Ny. (3.22) Now, we can draw the following conclusions for the measurement matrices of frame-wise and pixel-wise coded exposure schemes: Theorem 4 (RIP for frame-wise coded exposure). Let Ψ 3D denote the matrix representation of a 3D orthonormal transform defined in (3.11). For a given < δ < 1, the isometry constant of matrix Ȧf, δ S, for the RIP of order S in the basis Ψ 3D satisfies < δ S δ, with probability > 1 2 exp( c 1 log 2 S log 2 N) provided that M c 2 δ 2 N x N y S log 2 S log 2 N (3.23) where M = KN x N y, N = T N x N y, S is the sparsity level of the signal; c 1 and c 2 are some constants. Theorem 5 (RIP for pixel-wise coded exposure). Let Ψ 3D denote the matrix representation of a 3D orthonormal transform defined in (3.11). For a given < δ < 1, the isometry constant of matrix Ȧp, δ S, for the RIP of order S in the basis Ψ 3D 4

55 satisfies < δ S δ, with probability > 1 2 exp( c 1 log 2 S log 2 N) provided that M c 2 δ 2 µ 2 S log 2 S log 2 N (3.24) where M = KN x N y, N = T N x N y, S is the sparsity level of the signal, µ = µ(ψ 3D ) is the coherence of Ψ 3D ; c 1 and c 2 are some constants RIP for Column-row-wise Coded Exposure System In this section, we evaluate the RIP of random measurement matrix of column-rowwise coded exposure, Ȧ cr, in the basis Ψ 3D. The results here depend on what we call dimensional-coherence of the sparsity basis Ψ 3D which is defined as follows: Definition 6. Let Ψ 3D C N N be the following 3D orthonormal basis Ψ 3D = Ψ Y (Ψ X Ψ T ) (3.25) where N = T N x N y and Ψ T C T T, Ψ Y C Ny Ny and Ψ X C Nx Nx are 1D orthonormal transform matrices in temporal, vertical and horizontal directions, respectively. The dimensional-coherence of matrix Ψ 3D is defined as ξ = ξ(ψ 3D ) = N ( ) ( ) ( ) max t,t Ψ T (t, t ) max Ψ X (u, u ) u,u max v,v Ψ Y (v, v ) (3.26) where Ψ T (t, t ) is the entry of Ψ T at row t and column t. Ψ X (u, u ) and Ψ Y (v, v ) are defined similarly. Also, recall that matrix Ȧcr is an M N block diagonal matrix generated from A cr where M = KN x N y and N = T N x N y. For 1 u N x and 1 v N y, 41

56 the diagonal block of Ȧ cr, Φu,v, is generated from diagonal block of A cr, Φ u,v, by replacing s with 1/ K and 1s with 1/ K. Denote by φ k u,v row k, 1 k K, of Φ u,v and let φ k u,v(t), 1 t T, be t th entry of vector φ k u,v. We then have φ k u,v(t) = ṙk v(t)ċ k u(t) K (3.27) where and 1, if rv(t) k = ṙv(t) k = +1, if rv(t) k = 1 1, if c k u(t) = ċ k u(t) = +1, if c k u(t) = 1 (3.28) (3.29) with rv(t) k being the binary control signal of camera k at row v and time t and c k u(t) being the binary control signal of camera k at column u and time t. We can now draw the following theorem for the RIP of random measurement matrix of column-row-wised coded exposure, Ȧ cr, in the sparsity basis Ψ 3D : Theorem 6 (RIP for column-row-wise coded exposure). Let Ψ 3D denote the matrix representation of a 3D orthonormal transform defined in (3.11). For a given < δ < 1, the isometry constant of matrix Ȧcr, δ S, for the RIP of order S in the basis Ψ 3D satisfies < δ S δ, with probability ( 1 2 exp( c 1 log 2 S log 2 N) ) (1 2 exp( c 2 N y )) K (3.3) 42

57 provided that M c 2 δ 2 ξ 2 S log 2 S log 2 N (3.31) where M = KN x N y, N = T N x N y, S is the sparsity level of the signal, ξ = ξ(ψ 3D ) is the dimensional-coherence of Ψ 3D ; c 1 and c 2 are some constants. Proof. The proof here resembles that of Eftekhari et al. [17] which applies a powerful theorem in [22]. First, we represent an special case of this theorem to facilitate our proof. Theorem 7. Let F C M N be a set of matrices, and let c be a Rademacher vector, whose entries are i.i.d. random variables that take the values ±1 with equal probability. Denote by. F and. 2 Frobenius and spectral norms of a matrix. Set d F (F) = sup F F (3.32) F F d 2 (F) = sup F 2 (3.33) F F and E 1 = γ 2 (F, 2 ) (γ 2 (F, 2 ) + d F (F)) + d F (F)d 2 (F) (3.34) E 2 = d 2 (F) (γ 2 (F, 2 ) + d F (F)) (3.35) E 3 = d 2 2(F) (3.36) Then, for t >, it holds that ( Pr sup F F ) F c 2 2 E F c 2 2 c4 E 1 + t ( ( t 2 2 exp c 5 min, E2 2 t E 3 )) (3.37) 43

58 where c 4 and c 5 are positive constants and E is the expectation of a random variable. The term γ 2 (F, 2 ) is γ 2 -function of F which is a geometrical property of F and is widely used in the context of probability in Banach spaces [42], [43]. In the proof here, we only need an estimate of this quantity which is given in the following lemma presented in [43]: Lemma 1 ([43]). Let F denote a set of matrices. A set C (F,, r) is called a cover for the set F at resolution r and with respect to the metric, if for every F F, there exists F C (F,, r) such that F F r. The minimum cardinality of all such covers is called the covering number of F at resolution r and with respect to the metric, and is denoted by N (F,, r). The γ 2 -function of F is then bounded as follows γ 2 (F, 2 ) log 1 2 (N (F, 2, ν)) dν. (3.38) In the above lemma and throughout this section the notation a b means that there is an absolute constant c 1 such that a c 1 b. a b is defined similarly. To prove the RIP for random measurement matrix of column-row-wise coded acquisition, we need to express the problem in the context of Theorem 7. For 1 t T, 1 u N x and 1 v N y, let Ψ T,t, Ψ X,u and Ψ Y,v be columns t, u and v of 1D transform matrices Ψ T, Ψ X and Ψ Y, defined in (3.25), respectively. It is clear 44

59 that column n = (v 1)T N x + (u 1)T + t of Ψ 3D, denoted by Ψ v,u,t, is Ψ v,u,t = Ψ Y,v Ψ X,u Ψ T,t. (3.39) For all x C N, define the T T diagonal matrix F u,v (x) as Ψ v,u,1x Ψ v,u,2x F u,v (x) =. (3.4)... Ψ v,u,t x and let Ḟk u(x) denote the following T N y matrix ] Ḟ k u(x) = [F u,1 (x)ṙ k1, F u,2 (x)ṙ k2,, F u,ny (x)ṙ kny (3.41) where vector ṙ k v of length T is the Bernoulli random sequence of row v of camera k in time. Define the M T KN x matrix Ḟ(x) as Ḟ(x) = 1 K Ḟ 1 1(x)... Ḟ K 1 (x) Ḟ 1 2(x)... Ḟ K N x (x) (3.42) 45

60 and let ċ = ċ 1 1. ċ K 1 ċ 1 2. ċ K N x (3.43) be the random vector of length T KN x, representing Bernoulli random coded exposure of columns of K cameras. For all x C N, let f(x) = Ψ 3D x. For 1 t T, 1 u N x and 1 v N y, let n = (v 1)T N x + (u 1)T + t and denote by f v,u,t (x) the n th entry of f(x). From the definition of Ψ v,u,t in (3.39), it is clear that f v,u,t (x) = Ψ v,u,tx. (3.44) Now, define vector f u,v of length T as Ψ v,u,1x f v,u,1 (x) Ψ v,u,2x f v,u,2 (x) f u,v (x) = =... (3.45) Ψ v,u,t x f v,u,t (x) To express the RIP in the setting of Theorem 7, we use the following lemma which is proved in the Appendix. 46

61 Lemma 2. For all x C N, and Ȧcr, f(x), Ḟ(x) and ċ defined as above, it holds that E Ḟ(x) ċ 2 = x 2 2 (3.46) 2 Ȧcr f(x) 2 = Ḟ(x) ċ 2. (3.47) 2 2 By defining the set of all S-sparse signals with unit norm as Ω S = { x C N : x S, x 2 = 1 } (3.48) and using Lemma 2, we can write the restricted isometry constant of Ȧ cr (see Definition 3) as δ S = sup x Ω S = sup x Ω S Ȧcrf(x) 2 1 (3.49) 2 Ḟ(x) ċ 2 E Ḟ(x) ċ 2. (3.5) 2 2 Now, by setting the index set of the random process as F = {Ḟ(x) : x Ω s} (3.51) and using (3.5), we can express the RIP of column-row-wise coded exposure in the 47

62 settings of Theorem 7: ( Pr δ S = sup F F ( 2 exp c 4 min ) F ċ 2 2 E F ċ 2 2 c3 E 1 + t ( t 2, E2 2 t E 3 )). (3.52) The next step is to estimate the quantities involved in Theorem 7. We start this part by estimating Ḟ(x) 2: Ḟ(x) 2 = 1 K max u,k Ḟk u(x) 2 (3.53) = 1 max K u,k Ḟk u(x) 2 (3.54) = 1 N max K u,k x(n)ḟk u(e n ) (3.55) 2 1 K max u,k u,k n=1 N x(n) Ḟk u(e n ) 2 (3.56) n=1 1 ( ) max max x 1 K n Ḟk u(e n ) 2 (3.57) = 1 K x 1 max n,u,k Ḟk u(e n ) 2 (3.58) where {e n } N n =1 CN are canonical basis. In passing from (3.53) to (3.55), we used the linearity of Ḟ k u(.), passing from (3.55) to (3.56) follows from triangle inequality and (3.57) is the result of Holder inequality. The following lemma, which is proved in the Appendix, gives an estimate for Ḟk u(e n ) 2. Lemma 3. For 1 t T, 1 u N x and 1 v N y, let n = (v 1)T N x + (u 1)N x + t (3.59) 48

63 and denote by e n C N the canonical basis, i.e. n th entry of e n is 1 and the reset are zeros. It follows that Ḟk u(e n ) 2 Ψ X (u, u) Ṙk 2 max t Ψ T (t, t) max Ψ Y (v, v) (3.6) v where Ψ X (u, u) is the entry of 1D transform matrix in horizontal direction, Ψ X C Nx Nx, at row u and column u. Ψ T (t, t) and Ψ Y (v, v) are similarly defined for 1D transform matrices in temporal direction, Ψ T C T T, and in vertical direction, Ψ Y C Ny Ny, respectively. Using the result of Lemma 3 in (3.58), gives Ḟ(x) 2 1 x 1 max K k N = x 1 max T M k Ṙk 2 max u,u 2 Ṙk max u,u Ψ X (u, u) max Ψ T (t, t) max Ψ Y (v, v) (3.61) t,t v,v Ψ X (u, u) max Ψ T (t, t) max Ψ Y (v, v) (3.62) t,t v,v = ξ 2 x 1 max T M Ṙk (3.63) k = ξ MT Ṙ x 1 (3.64) where ξ = ξ(ψ 3D ) is the dimensional-coherence of orthonormal matrix Ψ 3D, Ṙ = max k Ṙk 2 and (3.63) follows from the definition of ξ (See Definition 6). To complete the proof, we need to calculate d F (F), d 2 (F) and γ 2 (F, 2 ). For 49

64 d F (F), we have d F (F) = sup Ḟ(x) F Ḟ(x) F 1 = sup x Ω S K u,k 1 = sup x Ω S K u,k 1 = sup x Ω S K u,k,v 1 = sup x Ω S K k 1 = sup x Ω S K k v Ḟk u(x) 2 F (3.65) F u,v (x)ṙ k v 2 2 (3.66) ṙv(t)ψ k v,u,tx 2 (3.67) t Ψ v,u,tx 2 (3.68) u,v,t Ψ 3D x 2 2 (3.69) 1 = sup x Ω S K k x 2 2 = sup x Ω S x 2 = 1 (3.7) For d 2 (F), using (3.64) as an estimate for Ḟ(x) 2 gives d 2 (F) = sup Ḟ(x) F Ḟ(x) 2 (3.71) ξ Ṙ sup x 1 (3.72) MT x Ω s S = ξ MT Ṙ (3.73) where (3.73) follows because for x Ω s, we have x 2 = 1 and x S, giving x 1 S. The following lemma, proved in the Appendix, gives an estimate for γ 2 (F, 2 ). 5

65 Lemma 4. Let F = {Ḟ(x) : x Ω S} where Ḟ(x) is define in (3.42). We have that γ 2 (F, 2 ) ξ S MT Ṙ log S log N. (3.74) Our goal is to calculate Pr(δ S > δ) for a prescribed < δ < 1. Assuming M δ 2 ξ 2 Ṙ 2 T 1 S log 2 S log 2 N and using (3.7), (3.73) and (3.74) we can compute E 1, E 2 and E 3 in Theorem 7: E 1 = γ 2 (F, 2 ) (γ 2 (F, 2 ) + d F (F)) + d F (F)d 2 (F) (3.75) ( ) S S S ξ MT Ṙ log S log N ξ MT Ṙ log S log N ξ MT Ṙ (3.76) δ(δ + 1) + δ log S log N (3.77) c 6 δ. (3.78) where we assumed S 1 and used the hypothesis that δ < 1. For E 2 we have E 2 = d 2 (F) (γ 2 (F, 2 ) + d F (F)) (3.79) ( ) S S ξ MT Ṙ ξ MT Ṙ log S log N + 1 (3.8) δ (δ + 1) log S log N (3.81) δ c 7 log S log N (3.82) 51

66 and finally E 3 = d 2 2(F) (3.83) ξ 2 S MT Ṙ2 (3.84) δ 2 c 8 log 2 S log 2 N. (3.85) Plugging the above estimates into Theorem 7, we obtain Pr (δ S c 4 c 6 δ + t) Pr (δ S c 4 E 1 + t) (3.86) ( = Pr sup Ḟ ċ 2 ) E Ḟ ċ 2 c 4E 1 + t (3.87) Ḟ F 2 2 ( ( )) t 2 t 2 exp c 5 min, (3.88) E2 2 E 3 2 exp ( c 5 log 2 S log 2 N min ( t 2 c 2 7δ 2, )) t c 8 δ 2 (3.89) where in (3.86) we used the fact that E 1 c 6 δ, (3.87) and (3.88) follow from (3.52). Setting t = δ and using the hypothesis that δ < 1 gives ( ( )) 1 Pr (δ S (c 4 c 6 + 1)δ) 2 exp c 5 log 2 S log 2 1 N min, c 2 7 c 8 δ (3.9) = 2 exp ( c 1 log 2 S log 2 N ) (3.91) where c 1 = min(c 2 7, c 1 8 ). By defining ˆδ = (c 3 c 5 + 1)δ, we obtain ( Pr δ S ˆδ ) 2 exp ( c 1 log 2 S log 2 N ). (3.92) Note that we can choose δ small enough to guarantee that ˆδ < 1. 52

67 We now want to bound the random variable Ṙ appearing in the bound on the number of measurements M. To achieve this goal we use the following lemma: Lemma 5. Let Ṙ = max k Ṙk 2, where Ṙk,1 k K is a T N y random symmetric Bernoulli matrix defined in (A.27). Also, assume T < N y. We then have ) Pr (Ṙ c 9 Ny (1 2 exp (c 2 N y )) K. (3.93) for some positive constants c 9 and c 2. Now, assume M c 2 9ˆδ 2 ξ 2 N y T 1 S log 2 S log 2 N, the probability of success is ( Pr δ S < ˆδ ) = Pr (δ S < ˆδ, ) Ṙ c 9 Ny (δ S < ˆδ, ) Ṙ c 9 Ny Pr = Pr + Pr (δ S < ˆδ, ) Ṙ > c 9 Ny ) ) (δ S < ˆδ Ṙ c 9 Ny Pr (Ṙ c 9 Ny (3.94) (3.95) (3.96) where in the last line we used the definition of joint probability. Given Ṙ c 9 Ny, we have M c 2 9N y ξ2 S log 2 S log 2 N T ˆδ 2 (3.97) Ṙ2 ξ2 S log 2 S log 2 N T ˆδ 2 (3.98) which let us use the probability bound in (3.92) for the right probability in (3.96). Plugging (3.92) and (3.93) in (3.96) gives ( Pr δ S ˆδ ) ( 1 2 exp ( c 1 log 2 S log 2 N )) (1 2 exp (c 2 N y )) K (3.99) 53

68 for M c 3ˆδ 2 ξ 2 N y T 1 S log 2 S log 2 N with some positive constant c 3 which concludes the proof Remarks on the RIP of Ȧf, Ȧ p and Ȧcr Like Ȧp for pixel-wise coded exposure, matrix Ȧcr is a block diagonal matrix with distinct blocks. Here, there seemingly is an issue that the diagonal blocks in Ȧcr are not strictly independent of each other, because the four random elements { φ k u,v, φ k u,v, φ k u,v, φ k u,v } (3.1) used in camera k have three degrees of freedom instead of four, due to the way random variables { φ k u,v} u,v,k are produced. But this correlation is negligibly weak; indeed, out of the total of KT N x N y random elements in Ȧcr, the probability that any randomly drawn four have three degrees of freedom is )( Ny 2 2 ( KT NxNy 4 KT ( N x ) ) = O(K 3 T 3 N 2 x N 2 y ), (3.11) which approaches to zero. As such, one can expect that the random measurement matrix A cr performs almost the same as the random measurement matrix A p. In fact, Theorem 6 affirms the above conjecture. The bound on the probability of success in Theorem 6 is smaller than the one in Theorem 5 by a factor of (1 2 exp(c 2 N y )) K. (3.12) 54

69 However, this factor is very close to one. For example, if c 2 N y > 16 then, 1 2 exp(c 2 N y ) > (3.13) Regarding the required number of measurements, note that when the sparsity basis is a three dimensional transform of the form Ψ 3D = Ψ Y Ψ X Ψ T (3.14) we have µ(ψ 3D ) = ξ(ψ 3D ). Therefore, the bound on the number of measurements in Theorem 6 is larger than the one in Theorem5 by a factor of N y /T. As such, assuming Ny/T = O(1) affirms that matrix Ȧcr satisfies the RIP asymptotically in the number of measurements as Ȧp with almost the same probability. In a change of perspective, let us discuss about the the required number of measurements for Ȧf, Ȧ p and Ȧcr with respect to the coherence of the matrices: The requisite number of measurements for Ȧf, Ȧ p and Ȧcr scales with γ(ψ 3D ), µ(ψ 3D ) and ξ(ψ 3D ), respectively. When the sparsity basis is a 3D transform matrix, no matter what the underlying 1D transform matrices, Ψ X, Ψ Y and Ψ T, are, we show that γ(ψ 3D ) = N x N y. However, µ(ψ 3D ) and ξ(ψ 3D ), which are equal for 3D basis, can take a value in the interval [1, N]. Therefore, in the best case scenario, where µ(ψ 3D ) = ξ(ψ 3D ) = 1, measurement matrices Ȧp and Ȧcr compares favorably to a dense Gaussian matrix of the same size. This is in the sense that Ȧp and Ȧcr require the same number of measurements as a dense Gaussian matrix does to satisfy the RIP up to a polylogarithmic factor [17]. The question now is how to find a 3D sparsity basis with the lowest coherence (and 55

70 dimensional-coherence)? From the definition of coherence, we can easily conclude that any 3D basis that is maximally incoherent with the canonical basis, achieves the lowest coherence. Therefore, a good candidate is Fourier basis, as it is known to be incoherent with the canonical basis. In fact, it is easy to verify that µ(ψ 3D ) = ξ(ψ 3D ) = 1, if Ψ 3D = Ψ Y Ψ X Ψ Y is the matrix representation of the 3D Fourier transform with Ψ Y, Ψ X and Ψ T being 1D Fourier transform matrices in vertical, horizontal and temporal directions. Choosing 3D Fourier basis for sparse representation of HFV signal is plausible as there exists high sample correlation in the spatio-temporal domain. This high correlation occurs due to the very high frame rate of HFV signal. Therefore, it is expected that most of the 3D Fourier transform coefficients of the HFV signal be zero or near zero. Indeed, we have examined the power spectra of the HFV signals found on the internet, they all exhibit exponential decay; more than 94 per cent of their Fourier transform coefficients have a magnitude less than.1 after normalization. 56

71 Chapter 4 Coded Exposure System with Random Rolling Shutter In the multi-camera coded exposure systems proposed in Chapter 2, the coded exposure images of all constituent cameras are acquired in the same time duration. In other words, the pixels of all constituent cameras are assumed to start and stop their exposure at the same time. This can be realized by global shutter technology. In a camera system of global shutter, all the pixels of the sensor array are exposed to the light in the same time duration. Once the exposure is finished, the accumulated charge is transferred to a single analog-to-digital converter (ADC) one pixel at a time. This requires the camera pixels to hold their charge until it can be sent to the ADC. Since pixels of charge-coupled device (CCD) sensor arrays are capable of holding the accumulated charge after the exposure is finished, the global shutter is mostly used in the CCD cameras. However, the cameras of the proposed multi-camera coded exposure systems use complementary metal-oxide semiconductor (CMOS) sensor arrays. Unlike pixels of CCD sensor arrays, the pixels of CMOS sensor arrays cannot hold the accumulated charge, i.e. the accumulated charge should be sent to ADC immediately 57

72 Frame Readout Time Frame Readout Time Frame Readout Time Pixel Array Row Time Single Scanline Readout Time Exposure Time (a) Single line exposure Pixel Array Row Time Single Scanline Readout Time Exposure Time (b) Multiple line exposure Pixel Array Row Time Single Scanline Readout Time Exposure Time (c) Full frame exposure Figure 4.1: Typical CMOS rolling shutter imager with different exposure times. after the exposure is finished. As such, most CMOS sensor arrays use what is called rolling shutter. A CMOS camera with rolling shutter is equipped with multiple ADCs with one ADC for each column of pixels. The rows (or scanlines in general) of a CMOS camera with rolling shutter start capturing light after receiving a reset signal. The reset signal is sent to scanlines in sequence starting from top to the bottom of sensor array. After the reset process has progressed some distance down the sensor array, the readout process begins by sending the select signal to the scanlines in the same order and the same speed as the reset process. When a scanline receives the select signal, all the pixels of the scanline transfer their charges to their corresponding ADCs. Once the transfer is finished, the pixels will be ready for the next exposure. The reset and select signals are generated by an address generator which is a simple shift register. The shift register causes the reset and select signals to sweep scanlines from top to bottom in an orderly fashion. The time difference between reset and select signals of a scanline determines the exposure (integration) time. As depicted in Fig. 4.1, the exposure time of a CMOS rolling shutter camera can vary from a single line to a couple of lines or to a full frame time. Although most CMOS cameras use rolling shutters, it is still possible to use global 58

73 shutters in CMOS sensor arrays. The most popular approach is called memory-inpixel, which adds an extra memory element to each pixel in order to temporarily store accumulated charge. In this scheme, which is similar to interline-transfer approach in CCD sensor arrays, all pixels of the sensor array start and stop their exposure at the same time. When the exposure is finished, accumulated charges are simultaneously transferred from photo-diodes to pixel-level memories. Once the transfer is complete, readout process starts. Meanwhile, the photo-diodes can start capturing light for the next frame. Adding pixel-level memory, increases the cost and complexity of the CMOS sensor array. It also reduces the in pixel fill-factor, which results in a decrease in quantum efficiency. As reviewed above, the rolling shutter CMOS image sensor architecture is of the simplest and the most cost-effective design in terms of semiconductor chip manufacturing. In comparison, all different designs of coded exposure cameras up to now are far more complex and expensive. Even the frame-wise coded exposure scheme, which has the so far simplest implementation of coded exposure, requires the use of global shutter. Aiming to make ultra high speed cameras of coded exposure more practical and affordable, in this chapter we develop a coded exposure video/image acquisition system by an innovative assembling of multiple rolling shutter cameras. Each of the constituent rolling shutter cameras adopts a slightly modified pixel read-out mechanism: pixel rows are read out not sequentially from top to bottom as in current rolling shutter cameras, but rather in a random order. By simply changing the read out order of pixel rows from sequential to random, the rolling shutter architecture realizes an apparatus of random coded exposure. The aforementioned random row scanning can be effected by a straightforward 59

74 ColumnKScanning RowKScanning TopKView Height Width Time Height Width Time SideKView CameraK1 Time CameraK Column Row Time Column CameraKK c SamplingKCoverageK Time Time Row SamplingKCoverage Time Row CameraKK r Time Column ExposureKTime ReadoutKTime Figure 4.2: Multi-camera code rolling shutter system. 6

75 modification of the row addressing circuit of current rolling shutter image sensor chips, so that the reset and select signals are sent to traverse pixel rows at random rather than consecutively from top to bottom. Upon a row being selected the pixels in it are read out sequentially via shift registers much the same way as in conventional rolling shutter cameras. As such the above modification does not use any additional in-pixel memory or decreases the fill-factor of the CMOS sensor array. In other words, one can turn conventional rolling shutter cameras into random coded exposure cameras without incurring any extra cost nor reducing the light efficiency of image acquisition. Conceptually, randomizing the exposure times of different pixel rows can be viewed as an inexpensive way of increasing the temporal sampling frequency, albeit at a low spatial resolution. But the low spatial resolution can be compensated by the use of multiple rolling shutter cameras that operate on different random row scanning sequences. Fig. 4.2 depicts an embodiment of the proposed multi-camera system. In this embodiment, the ultra-high speed video acquisition system is comprised of K = K r + K c rolling shutter CMOS image sensor arrays, each of which has its pixel rows read out in some random order. K r of these K cameras are oriented horizontally so that scan-lines of pixels read-out correspond to rows of the acquired image; K c of these K cameras are oriented vertically so that scan-lines of pixels read-out correspond to columns of the acquired image. By placing the random coded exposure rolling shutter cameras perpendicularly to one the other, the imaging system can capture high frequency features of an image in different directions. Collectively, at a given instance or more precisely in a given very short duration of time, the pixels from the constituent cameras constitute a set of samples randomly 61

76 distributed in the two-dimensional image space. This set of random samples can be used to recover the underlying image/frame free of motion blur by sparsity-based image restoration methods or other computational imaging methods. 4.1 Coded Rolling Shutter Architecture in Literatures The idea of coded rolling shutter was first proposed by Gu et al. [32]. In their work, they proposed two coding schemes for CMOS camera systems with rolling shutter, called interlaced readout and staggered readout, for better sampling of time dimension. The interlaced readout scheme of [32], depicted in Fig. 4.3a, is similar to interlacing in video broadcasting systems. In this scheme, the total readout time for one frame is uniformly distributed into F sub-images. Therefore, assuming the vertical resolution of a full image is L, each sub-images has L/F rows. Like previous scheme, the staggered readout scheme of [32], depicted in Fig. 4.3b, captures F sub-images with vertical resolution of L/F. However, this scheme reverses the order of readout within every F neighboring rows. Cubic and bidirectional interpolation is applied on the captured sub-images to recover images without skew at higher temporal resolution. Gu et al. also proposed two coded exposure schemes, called adaptive row-wise auto-exposure (AE) and staggered readout and coded exposure, for high dynamic range (HDR) imaging. The adaptive row-wise auto-exposure scheme of [32], depicted in Fig. 4.4a, estimates the optimal exposure time for each row of the sensor array. Then, the timing of reset signal of each row is altered such that each row is exposed to the 62

77 Frame Readout Time Frame Readout Time Sensor Array Rows Sensor Array Rows Single Scanline Readout Time Time Exposure Time (a) Interlaced readout (F = 2) Single Scanline Readout Time Time Exposure Time (b) Staggered readout (F = 2) Figure 4.3: Coded readout schemes proposed in [32] for better sampling of time dimension. Frame Readout Time Frame Readout Time Sensor Array Rows Sensor Array Rows Single Scanline Readout Time Time Exposure Time (a) Adaptive row-wise AE Single Scanline Readout Time Time Exposure Time (b) Staggered readout and coded exposure Figure 4.4: Coded exposure schemes proposed in [32] for HDR imaging. light for its estimated optimal duration. As depicted in Fig. 4.4b, the staggered readout and coded exposure scheme of [32] uses staggered readout with F = 3 to capture 3 sub-images, each of which having a predefined exposure time. These sub-images are resized vertically to full resolution using cubic interpolation. To remove motion blur, they first estimate motion vector using the captured sub-images. The computed flow is used to estimate the two blur kernels which is then used in a deblurring algorithm. Although, it is possible to achieve random coded exposure rolling shutter with the design in [32], they did not propose a design similar to our random coded exposure rolling shutter. None of the proposed schemes in [32] perform random readout of 63

78 scanlines. Only adaptive row-wise auto-exposure scheme utilizes some kind of randomness for the exposure time of scanlines. Also, our proposed camera system utilizes multiple cameras with random coded rolling shutters as opposed to the single-camera approach in [32]. 4.2 Multi-camera Coded Exposure System with Coded Rolling Shutter Cameras The proposed multi-camera coded exposure consists of K = K r + K c coded rolling shutter cameras, each of which has spatial resolution of L L pixels. K r of the cameras scan the scene in the row-wise fashion and K c of them are rotated 9 degrees to sweep the scene in a column-wise fashion. The readout and exposure time of scanlines of each row-wise scanning camera are controlled by two random vectors, p kr r and e kr r, 1 k r K r. Similarly, the readout and exposure time of scanlines of each column-wise scanning camera are controlled by two random vectors, p kc c 1 k c K c. Random vectors {p kr r } Kr k r=1 and e kc c, and {pkc c } Kc k c=2, are random permutations of integers from 1 to L which determine the readout order of scanlines, i.e. scanline l is the p kr r (l) th one to be readout in row-wise scanning camera k r. Let t r denote the readout time of a single scanline. For some positive integer J, let E max = J t r denote the maximum exposure time of scanlines. The elements of random vectors {e kr r } Kr k r=1 and {ekc c } Kc k c=1, which are chosen from the set {1, 2,, J} uniformly at random, determine the exposure time of scanlines. Specifically, the exposure time of scanline l for row-wise scanning camera k r is e kr r (l) t r. The address generator of each row-wise (column-wise) scanning camera uses p kr r and e kr r (p kc c and e kc c ) to determine 64

79 the time instances reset and select signals which should be sent to the scanlines. Specifically, for scanline l of row-wise scanning camera k r, 1 k r K c, the address generator sends reset signal at time t kr r (l) = E max + ( p kr r (l) e kr r (l) 1 ) t r (4.1) = ( J + p kr r (l) e kr r (l) 1 ) t r (4.2) and sends select signal at time ˆt kr r (l) = E max + ( p kr r (l) 1 ) t r (4.3) = ( J + p kr r (l) 1 ) t r. (4.4) To represent above equations in matrix form, let t kr r (1) t kr t kr r (2) r =. t kr r (L) (4.5) and ˆt kr r (1) ˆt kr r (2) ˆt =. (4.6). ˆt kr r (L) We then have t kr r = ( p kr r e kr r + J 1 ) t r (4.7) 65

80 Frame Time (t f ) Scanline e t E max Frame Readout Time ˆt p l = 8 Address for Reset Signal Address for Select Signal Exposure t r Figure 4.5: Timing setting for reset and select signal in a coded rolling shutter camera system with L = 1 and J = 9. and ˆt kr r = ( p kr r + J 1 ) t r. (4.8) Similarly for column-wise scanning camera k c, 1 k c K c, we have t kc c = ( p kc c e kc c + J 1 ) t r (4.9) and ˆt kc c = ( p kc c + J 1 ) t r. (4.1) To better understand this process, we show a simple example for a coded rolling shutter with L = 1 and J = 9 in Fig For scanline l = 8, we have e(8) = 4 and 66

81 p(8) = 8. The address generator sends reset signal to scanline l = 8 at time t(8) = (p(8) e(8) + 9 1) t r (4.11) = ( ) t r (4.12) = 12 t r (4.13) and sends select signal to it at time ˆt(8) = (p(8) + 9 1) t r (4.14) = ( ) t r (4.15) = 16 t r. (4.16) Now define the frame time, denoted by t f, as the amount of time required to capture and readout the whole frame. It is clear that t f = E max + L t r = (L + J) t r. (4.17) Assume we want to recover T target frames in the frame time t f. To simplify the problem, we choose T (J + L) such that (J + L) is divisible by T. The temporal resolution of the T target frames, denoted by t t, is t t = t f T = L + J T t r. (4.18) Let r kr v = [ r kr v (1), r kr v (2),, r kr v (T ) ] be the binary random sequence of row v, 1 67

82 v L, of the row-wise scanning camera k r, 1 k r K r. We then have r kr v (t) = 1, t kr r (v) t t t 1 <, otherwise ˆt kr r (v) t t (4.19) where x gives closest integer to x. Similarly, for the binary random sequence c kc u = [c kc u (1), c kc u (2),, c kc u (T )] of column u, 1 u L, of column-wise scanning camera k c, 1 k c K c, we have c kc u (t) = 1, t kc c (u) t t t 1 <, otherwise ˆt kc c (u) t t. (4.2) The role of equations in (4.19) and (4.2) is to adjust the resolution of exposure and readout process to the desired temporal resolution of T target images. To better understand how r kr v and c kc u are calculated, we plot the capturing process of a rowwise and a column-wise scanning camera in Fig 4.6. In Fig. 4.6, L = 12, J = 12 and T = 8. Therefore, from (4.18) we have t t = 3 t r. The start of exposure and readout of row v = 9 of row-wise scanning camera are t r (9) = t r and ˆt r (9) = 14 t r, respectively. Therefore, and t r (9) = t t (4.21) ˆt r (9) = 5. t t (4.22) 68

Image Rows Frame Readout Time Row-wise Scanning t t t r Image Columns Column-wise Scanning 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Time Exposure Missmatch due to Rounding

83 Image Rows Frame Readout Time Row-wise Scanning t t t r Image Columns Column-wise Scanning Time Exposure Missmatch due to Rounding Readout v = 9 Figure 4.6: Calculating the random binary sequences of a row-wise scanning camera and a columnwise scanning camera with coded rolling shutter. The number of scanlines is L = 12, J = 12 and the number of target frames is T = 8. s and 1s are the values of r v and c u determined using (4.19) and (4.2). From (4.19), we get the following binary random sequence for row v = 9 r 9 = [ 1, 1, 1, 1, 1,,, ]. (4.23) The above process of adjusting the temporal resolution introduces additional noise in the measurements which is due to rounding up/down the exposure time. Increasing T, i.e. increasing temporal resolution of target frames, will reduce this noise. However, increasing temporal resolution reduces sub-sampling ratio, i.e. the number of random measurements over signal length. This in turn, can lead to reduced quality of recovered target frames. Nevertheless, since the readout time of a single scanline 69

84 is very short (in the scale of micro-seconds), the magnitude of the noise caused by temporal resolution adjustment is usually small. Now let f 1, f 2,, f t denote T target sharp images and vector f 1 (u, v) f 2 (u, v) f u,v =. f T (u, v) (4.24) be the time series of pixels at spatial location (u, v). Row-wise scanning camera k r, 1 k r K r, modulates the time signal f u,v with r kr v and makes random measurement y r k r (u, v) = r kr v, f u,v + n r k r (u, v) (4.25) where n r k r (u, v) is the measurement noise of the camera at location (u, v). Similarly, column-wise scanning camera k c, 1 k c K c, modulates the time signal f u,v with c kc u and makes random measurement y c k c (u, v) = c kc u, f u,v + n c k c (u, v) (4.26) 7

85 where n c k c (u, v) is the measurement noise the camera at location (u, v). Now let vector y1(u, r v). y y u,v = r K r (u, v) y1(u, c v). yk c c (u, v) (4.27) be the K = K r + K c random measurements of f u,v captured by the K r row-wise scanning and K c column-wise scanning coded rolling shutter cameras. The measurement vector y u,v of f u,v is y u,v = B rl u,vf u,v + n u,v (4.28) where n r 1(u, v). n n u,v = r K r (u, v) n c 1(u, v). n c K c (u, v) (4.29) is the measurement error vector and B rl u,v is the following K T binary random 71

86 measurements matrix B rl u,v = r 1 v. r Kr v c 1 u. c Kc u. (4.3) For denotation convenience, the three dimensional T L L pixel grid of T target images is written as a super vector f, formed by stacking all the T L L pixels in question. Also, let y be the vector formed by stacking all L 2 measurement vectors y u,v of length K, 1 u, v L. Then, the capturing process of K = K r + K r camera coded exposure with coded rolling shutter can be stated as y = A rl f + n (4.31) where A rl is the following KL 2 T L 2 block diagonal matrix B rl 1,1 2 T... 2 T 2 T B rl 1, T A rl =. (4.32)... 2 T 2 T... B rl L,L This way, the K-camera coded exposure system with coded rolling shutter captures K random projection of the 3D spatio-temporal volume in temporal direction. The recovery of T target frames from K captured images by exploiting the temporal and spatial sparsity of the 3D data volume is the subject of the next chapter. 72

87 Chapter 5 Recovery of High Frame-rate Video Signals 5.1 Sparse Analysis Model The recovery of the original 3D signal f (the T target HFV frames) from the K images of coded exposure is, of course, a severely underdetermined inverse problem and has an infinite number of solutions. In Chapter 3 we show that if the coefficient vector of signal f, x = Ψf, is nearly sparse in basis Ψ, it is possible with high probability, to recover a good approximation of x by solving the following l 1 minimization problem: x = arg min ˆx 1 subject to AΨ x y 2 σ. (5.1) ˆx The recovery model defined in (5.1) is usually called sparse synthesis model which now has solid theoretical foundations and is a stable field [44]. Alongside this approach, there is sparse analysis model which uses a possibly redundant analysis operator Θ R P N (P N) to exploit sparsity of signal f, i.e., signal f belongs to 73

88 analysis model if Θf is small enough [45]. In this section, we develop an HFV recovery method based on a sparse analysis model which employs strong temporal and spatial correlations of the HFV signal. If we assume that the object(s) in the video scene has flat surfaces and is illuminated by parallel light source like the sun, then the 2D intensity function f t (u, v) of frame t can be approximated by a piecewise constant function based on the Lambertian illumination model. For the same reason, we can use a piecewise linear model of f t (u, v), if the light source is close to the objects or/and the objects in the scene have surfaces of small curvature. Assuming that each target frame, f t (u, v), is a 2D piecewise linear function, the Laplacian 2 u,vf t of f t (u, v) is zero or near zero at most pixel positions of the uv plane and takes on large magnitudes only at object boundaries and texture areas. In other words, 2 u,vf t offers a sparse representation of f t which results from intra-frame spatial correlations. The other source of sparsity is rooted in temporal correlations of HFV signals. Precisely because of high frame rate, the object motion between two adjacent frames is very small in magnitude. As such, most pixels will remain in the same object from frame f t to f t+1. Moreover, general affine motion can be satisfactorily approximated by translational motion (du, dv) if it is small enough. As long as the small motion, (du, dv), does not move a pixel outside of an object whose intensity function is linear, i.e., f t (u, v) = au + bv + c, we have t f t (u, v) = f t+1 (u, v) f t (u, v) = f t (u + du, v + dv) f t (u, v) = adu + bdv. (5.2) 74

89 This implies that the first-order difference t f t (u, v) in time remains constant in the intersection region of an object segment across two adjacent frames. By considering t f t as a 2D function in the uv plane, it follows from (5.2) that t f is piecewise constant. Therefore, the total variation of the 2D function t f, namely u,v ( t f), is another sparse representation of f. Using the two sparsity models described above we can now define the redundant analysis operator, Θ of size (2T N x N y ) (T N x N y ), as Θ = Θ 1 (5.3) Θ 2 where Θ 1 and Θ 2 of size (T N x N y ) (T N x N y ) are matrix representations of the Laplacian operator ( 2 u,v) and u,v ( t ) operator, respectively. Since, as explained above, Θf is sparse, we can consider Θ to be the redundant analysis operator for HFV signal f. Therefore, we can formulate the HFV recovery problem in the context of this sparse analysis model as f = arg min Θˆf 1 subject to Aˆf y 2 σ (5.4) ˆf where σ is the variance of the measurement error. The above minimization problem is computationally a linear programming problem, which can be solved efficiently using one of many standard optimization techniques. In general, sparse synthesis model defined in (5.1) and above sparse analysis model are different. In a special case where Θ is orthonormal, the two models are the same with Θ = Ψ 1 [45]. Although a large number of applications for (5.4) are found, 75

90 theoretical study of sparse analysis model is not as thorough and well established as sparse synthesis model in the compressive sensing literature. Recently, Li in [44] addressed this gap by introducing the following generalized RIP. Definition 7. Measurement matrix A satisfies generalized RIP of order S with isometry constant < δ S < 1 if (1 δ S ) Θf 2 2 Af 2 2 (1 + δ S ) Θf 2 2 (5.5) holds for all f which are S-sparse after transformation of Θ, i.e., Θf S. Li shows that if measurement matrix A satisfies generalized RIP with δ 2S < 2 1, it is guaranteed that the sparse analysis model defined in (5.4) can accurately recover signal f which is sparse in arbitrary over-complete and coherent operator Θ. It is also shown that nearly all random matrices satisfying RIP satisfy generalized RIP as well. However, to the best of our knowledge, no theoretical results are known on generalized RIP for the type of random measurement matrices used in this thesis. Nevertheless, the experimental results (Chapter 7) show that, given the number of cameras, the quality of recovered HFV signals using (5.4) is better than the quality of recovered HFV signals using the sparse synthesis model defined in (5.1). 5.2 Dictionary Based Sparse Coding The sparse analysis recovery model (5.4) uses a fixed global sparse representation operator Θ. However, natural images/videos are statistically non-stationary and their sparsity exhibits in spaces that vary in spatial domain. The quality of the recovered signal can be increased if one can identify and exploit varying local structures of the 76

91 signal f. This can be done by employing universal dictionaries learned from example image patches, as they can better adapt to local image structure. Mathematically, the sparse representation model with learned dictionary, D, assumes signal f R N can be represented as f Dx where x is nearly sparse. The sparse decomposition of f is obtained by solving the following l 1 minimization problem x f = arg min f Dˆx 2 + λ ˆx 1 (5.6) ˆx where λ is the regularization term that adjusts the sparse approximation error and the sparsity of x. In the context of compressive sensing, one can recover f from its random measurements y by first, sparsely coding y with respect to D via the following minimization problem x y = arg min ADˆx y 2 + λ ˆx 1 (5.7) ˆx where A is the random measurement matrix. An estimate of f can then be reconstructed as f = D x y. In order to represent various image local structures, many dictionary learning methods use a universal and over-complete dictionary. However, it has been shown that sparse coding with an over-complete dictionary is unstable [46]. Dong et al. [47] proposed an adaptive sparse domain selection (ASDS) scheme for sparse representation which uses a set of compact sub-dictionaries rather than a single universal overcomplete dictionary. Specifically, they cluster the training patches into D clusters. Since the patches in a cluster are similar, there is no need to learn an over-complete dictionary for each cluster [48]. As such, simple principal component analysis (PCA) 77

92 technique can be used to learn a sub-dictionary for each cluster. Then, for a given patch, a compact PCA sub-dictionary is adaptively selected to code it. Since the given patch can be better represented by the adaptively selected sub-dictionary, the whole image can be reconstructed more accurately than using a universal dictionary [47]. In general, it is difficult to recover the original true sparse code x f from y, due to presence of noise in the measurement vector y. To address this problem, Dong et al. [48] extended their work in [47] and proposed a non-locally centralized sparse representation (NCSR) model that reduces the sparse coding noise (SCN), x y x f, by centralizing the sparse codes to some good estimation of x f. In the next section, we propose a recovery algorithm based on NCSR model of [48] to recover HFV signals Dictionary Based Sparse Coding for HFV Recovery For now, assume {D d } D d=1 orthonormal sub-dictionaries are given. We will later explain how the set of D sub-dictionaries can be built. Let f denote the vector representation of the N x N y T HFV signal. Let f u,v,t R n, 1 u N x, 1 v N y and 1 t T denote the vector representation of the n n patch, centered at location (u, v) and time t of the HFV signal. We then have f u,v,t = P u,v,t f (5.8) where P u,v,t R n N is the matrix, extracting patch f u,v,t from f and N = T N y N y. Assume for patch f u,v,t, a sub-dictionary D du,v,t is selected. Patch f u,v,t can be sparsely represented as f u,v,t D du,v,t x fu,v,t by solving (5.6). Once all the sparse codes x fu,v,t, 78

93 1 u N x, 1 v N y and 1 t T, are estimated, the whole HFV signal can be reconstructed by solving an over-determined system. According to [49], this over-determined system has the following closed-form solution ˆf = ( u,v,t ) 1 ( ) P u,v,tp u,v,t P u,v,t D du,v,t x fu,v,t. (5.9) In words, the above equation reconstructs f by simply averaging each reconstructed patch of f u,v,t. Note that the matrix to be inverted in the above equation is a diagonal matrix. As such, the calculation in (5.9) can be done in a pixel-by-pixel fashion [49]. Also, one can use overlapping patches to better suppress noise and blocking artifacts. For denotation convenience, let u,v,t ( ) 1 ( f D x f = P u,v,tp u,v,t P u,v,t D du,v,t x fu,v,t) u,v,t u,v,t (5.1) where D is the concatenation of all sub-dictionaries {D d } and x f denotes the concatenation of all { x fu,v,t }. In the proposed coded exposure acquisition systems, the objective is to recover f from random measurement y = Af + n, where A is any of A f, A p, A cr and A rl. Using the above sparse coding, one can recover f from y by solving the following minimization problem x y = arg min y AD ˆx λ ˆx 1. (5.11) ˆx Once x y is recovered, HFV signal, f, can be reconstructed as f = D x y. The hope is that x y is as close as possible to x f (the true sparse codes of HFV signal), i.e. 79

94 ν x = x y x f is as close as possible to zero. However, as stated in [48], it is usually not the case due to presence of noise in the captured measurements. If the true sparse codes, x f, are known, one can improve the quality of recovered signal f by suppressing ν x. In general, x f is unknown but we can obtain a good estimate of x f as we will explain later. Let β denote the estimate of x f. Using β and the centralized sparse representation (CSR) model proposed by Dong et al. in [5], we can improve the accuracy of x y by solving the following minimization problem x y = arg min y AD ˆx λ ˆx u,v,t 1 + γ ˆx u,v,t β u,v,t 1 (5.12) ˆx u,v,t u,v,t where β u,v,t is a good estimate of x fu,v,t, γ is the regularization parameter and ˆx is the concatenation of all {ˆx u,v,t }. The last term in the above minimization problem suppresses ν x by centralizing the sparse codes to the estimate of x f (which is β). In (5.12), the term ˆx u,v,t 1, which is conventionally used in sparse representation models, ensures that ˆx u,v,t is sparse. In other words, it assures that only a small number of atoms are selected from the over-complete dictionary. Our recovery algorithm does not use over-complete dictionary but a collection of sub-dictionaries. By adaptively selecting a sub-dictionary for each patch, we are indirectly setting the coefficient of this patch to be zero for all other sub-dictionaries. This assures that the the coding coefficients are sparse. As such, the term ˆx u,v,t 1 can be dropped, leading to the following minimization problem x y = arg min y AD ˆx γ ˆx u,v,t β u,v,t 1. (5.13) ˆx u,v,t We use the above minimization problem, which is called non-locally centralized sparse 8

95 representation (NCSR), to recover HFV signal f = D x y. Next, we discuss how to build PCA sub-dictionaries and how to obtain a good estimate, β, of the unknown sparse codes x f Building PCA Sub-dictionaries We use the adaptive sparse domain selection strategy proposed in [47] to build PCA sub-dictionaries. However, rather than using a set of high-quality natural images as the training set, we use frames of an initial estimate of HFV signal. This initial estimate can be obtained by solving the minimization problem (5.4). Given the estimated HFV signal, we crop a huge number of n n patches from it. In building PCA sub-dictionaries, smooth patches are discarded by only selecting patches which have large enough intensity variance, i.e. their intensity variance are greater then some threshold. Let C = {s 1, s 2,, s Q } be the set of selected patches, where s i R n is the vector representation of patch i. Same as in [47], by using the output of highpass filter as the feature for clustering, the Q patches are partitioned into D clusters as follows. Let C h = {s h 1, s h 2,, s h Q } denote the output of high-pass filter applied on the members of the set C. We partition C h into D sub-sets, {C h 1, C h 2,, C h D } using K-means algorithm. Once C h is partitioned, we can partition C into D subsets {C 1, C 2,, C D }. The next step is to learn a sub-dictionary D d for the cluster C d, 1 d D. Conventionally, a dictionary is learned from a set of patches by jointly optimizing the dictionary and the representation coefficients. This joint optimization problem is non-convex and requires much computational cost [47]. As stated in [47], since the elements of C d have similar patterns, there is no need to learn an over-complete 81

96 dictionary for each dataset C d. As such, Dong et al. [47] proposed to use PCA to approximate the joint optimization problem of dictionary learning. Specifically, they obtained the orthonormal sub-dictionary, D d, for dataset C d by applying PCA to the covariance matrix of C d. Once all D orthonormal sub-dictionaries are formed, one can adaptively assign a sub-dictionary to each patch that is to be coded. Since the HFV signal f is unknown, an initial estimate of HFV signal, denoted by f, are used. Let f u,v,t denote a local patch of f at location (u, v) and time t. The best fitted sub-dictionary for f u,v,t is selected by solving the following minimization problem d u,v,t = arg min f u,v,t h µ d 2 (5.14) d where f u,v,t h is the high-pass filtered patch f u,v,t and {µ d } D d=1 is the centroids of datasets {Cd h}d d=1. Once a best fitted sub-dictionary is selected for each patch of f, x y can be updated by solving the minimization problem (5.13). The estimation of f can then be updated the by letting f = D x y. The updated f is then used to learn a new set of PCA sub-dictionaries which are then assigned to the patches of updated f. This iterative process continues until the estimation f converges Non-local Estimate of Unknown Sparse Code As discussed before, the minimization problem (5.13) uses an estimate of the true sparse code x f to suppress the sparse coding noise. In this section we explain how a good estimate of the true sparse codes, β, can be obtained. While there are many ways to estimate x f, such as using training images which 82

97 are similar to the frames of the target HFV signal, we use the method proposed in [48] which estimates β from the input data f. To find a good estimate of the true sparse code for a given patch f u,v,t at location (u, v) and time t, one first searches for non-local similar patches in a large 3D window centered at (u, v, t). It is expected to find many similar patches to f u,v,t based on the fact that natural images usually contain repetitive structures [51]. Let Λ u,v,t denote the set of sparse codes of patches similar to f u,v,t. An estimate of the true sparse code of the patch at location (u, v) and time t can then be obtained as β u,v,t = ωu,v,t x j j u,v,t (5.15) x j u,v,t Λu,v,t where ω j u,v,t is the weight which is calculated as follows [51]: ω j u,v,t = 1 W exp ( f u,v,t f j u,v,t 2 2 h ) (5.16) where h is a predetermined scalar and W is the normalization factor. As discussed before, the minimization problem (5.13) is solved iteratively and it uses the HFV signal recovered by solving (5.4) as the initial estimate of HFV signal. Using the initial estimate, the PCA sub-dictionaries are built and the nonlocal estimates of {β () u,v,t} are obtained. At iteration l, the new estimate of signal f is improved which in turn improves the accuracy of {β (l) u,v,l } and so on. This iterative process stops once the optimization problem falls into a local minimum. 83

98 5.3 Algorithm Complexity The recovery algorithm proposed in Section 5.1 is based on Total Variation minimization by Augmented Lagrangian and Alternating direction Algorithms (TVAL3) which developed by Li [52]. The dictionary based recovery algorithm in Section 5.2 also uses TVAL3 in each iteration. As stated in [52], the theoretical analysis on convergence and convergence rate of the TVAL3 scheme has not yet been fully investigated. As such we cannot precisely calculate the running time of the recovery algorithms proposed in this thesis. In this section we aim to give an estimate for the running time of the proposed recovery algorithm Complexity of Sparse Analysis Recovery Algorithm Recall that the recovery algorithm based on sparse analysis model solves the following optimization problem: min f Θf 1 subject to Af y 2 σ (5.17) which is equivalent to min w,f w 1 subject to (5.18) Af y 2 σ and w = Θf 84

99 The augmented Lagrangian problem corresponding to (5.18) is min w,f w 1 ν T (Θf w) λ T (Af y) + β 2 Θf w µ 2 Af y 2 2 (5.19) The pseudocode of the recovery algorithm based on sparse analysis model is presented in Algorithm 1. Computing step length (Line 5) requires O(KN) arithmetic operations. Line 6 also requires O(KN) arithmetic operations. Lines 7 to 11 require O(N) operations each. Therefore, the total number of arithmetic operations inside the loop is O(KN). Note that if matrix Θ was a dense matrix the number of arithmetic operations would have been O(N 2 ) but since Θ is a sparse matrix, with O(1) non-zeros in each row, we only require O(N) arithmetic operations to calculate multiplication. Similarly, since A is a block diagonal matrix rather than a dense one, we can argue that matrix multiplication on A requires O(KN) arithmetic operations. In order to estimate the complexity of TVAL3, we need to know how many iterations are required to converge. The exact number of iteration depends on the each problem settings and as stated in [52], it has not yet been fully determined. However, based on the experiments the while loop usually terminates after O( N) iterations. As such an estimate on the number of arithmetic operation for TVAL3 optimization function is O(KN 3 2 ) Complexity of the Dictionary based Recovery Algorithm The pseudocode of the recovery algorithm based on sparse analysis model is presented in Algorithm 1. The running time of generating dictionary with KMeans algorithm 85

100 Algorithm 1 TVAL3 1: procedure TVAL3(Θ,A,y) 2: initialize w, f, ν, λ, k to zero 3: while not converged do 4: // fix w k, do Gradient Descent 5: compute step length τ, α 6: compute g (k) = Θ T ν A T λ + βθ T (Θf (k) w (k) ) + µa T (Af (k) y) 7: determine f k+1 = f k αg k 8: given f (k+1), compute w (k+1) by shrinkage 9: // update Lagrangian multipliers 1: ν = ν β(θf (k+1) w (k+1) ) 11: λ = λ µ(af (k+1) y) 12: end while 13: end procedure is O(N c log N) where c > 3/2 is a constant. The running time of block matching and sparsity estimate operations is O(N) For running time of TVAL3 we have O(KN 3 2 ). Therefore, the running time of the loop is dominated by O(N c log N). Since in the implementation rather than waiting for convergence we terminate the loop after 4 iterations, the overall running time of the dictionary based recovery algorithm is approximately O(N c log N). Algorithm 2 Dictionary Based Recovery Algorithm 1: procedure DicBase(Θ,A,y) 2: set f () to initial estimate 3: set k = 4: while not converged do 5: generate dictionary from f (k) 6: block matching for patches in f (k) 7: sparsity estimate 8: TVAL3(Θ, A, y) with f (k) as initial estimate 9: set f (k+1) to output of TVAL3 1: k = k : end while 12: end procedure 86

101 Chapter 6 System Attributes In this chapter, we discuss some of the attributes of proposed sparsity-based HFV recovery algorithm. We first examine how we can use point-spread function (PSF) of coded exposure cameras along with the random perturbation of cameras placement on the mounting panel to eliminate the need for precise spatial registration of the cameras. Next, we explain how the recovery can be sped up with parallel processing, which is achieved by splitting the recovery into smaller sub-problems. Finally, we discuss how the proposed coded acquisition systems and the recovery algorithm can be modified to capture and recover color information. 6.1 HFV Recovery using Relative Displacement of Cameras The proposed sparsity-based HFV recovery algorithm from random measurements has an important side benefit: the relative simplicity of cameras assembly and calibration compared with other multi-camera systems such as the one in [1]. After the K cameras are compactly mounted, the relative displacements among these cameras 87

102 (a) (b) Figure 6.1: (a) Relative positioning of the cameras in the image plane. Black circles are the pixels of HFV signal we want to reconstruct. Blue and orange circles are the pixels captured by different cameras. (b) Cylindrical view of a specific pixel of a camera in time. can be measured by imaging a calibration pattern. But there is no need for precise spatial registration of all the K cameras. As illustrated in Fig. 6.1, once the relative positioning of the cameras in the image plane is determined and given the point spread function (PSF) of the cameras, each of the KN x N y random measurements of the HFV signal f can be viewed, via coded acquisition, as a random projection of the pixels in a cylindrical spatial-temporal neighborhood. The pixel value recorded by camera k at location (x, y) and time t is gt k (x, y) = f t (u, v)h(x u, y v) (6.1) (u,v) W where h(, ) is the PSF of the camera with convolution window W. By forming a super vector g out of all the KT N x N y pixels captured by the K cameras the same 88

103 way as in f, we can write (6.1) in matrix form as g = Hf (6.2) where H is the KT N x N y T N x N y convolution matrix that is determined by the relative positioning of the cameras and the camera PSF. Each row of H consists of the weights h(x u, y v) in (6.1) for distinctive k and (x, y). Given pixel location (x, y), camera k modulates the time signal g1(x, k y) gx,y k g2(x, k y) =. (x, y) g k T (6.3) by the binary coded exposure sequence b x,y k, and generates a random measurement of f u,v s, (u, v) W, y k (x, y) = b x,y k, gk x,y + n k. (6.4) Using super vector g, we can also represent (6.4) in the matrix form as y = Bg + n. (6.5) where B is the KN x N y KT N x N y binary matrix made of KN x N y binary pseudo 89

104 random sequences as follows B = b 1, T... 1 T... 1 T.. 1 T... b 1,1 K... 1 T... 1 T.. 1 T... 1 T... b i,j k... 1 T.. 1 T... 1 T... 1 T... b Nx,Ny K (6.6) We can combine the operations (6.2) and (6.5) and express the KN x N y random measurements in the following matrix form y = Bg + n = BHf + n = Af + n (6.7) where A is the KN x N y T N x N y random measurement matrix, which can be determined once the relative positioning of the K cameras is measured via the calibration of camera assembly. Experiments show that the above fractional pixel registration does not degrade the quality of recovered HFV signals. 6.2 HFV Recovery with Parallel Processing The ability of the proposed multi-camera system to acquire very high speed video with conventional cameras is at the expense of very high complexity of the HFV reconstruction algorithm. In this system aspect, the new HFV acquisition technique 9

105 seems, on surface, similar to compressive sensing. But the former can be made computationally far more practical than the latter, thanks to a high degree of parallelism in the solution of f = arg min Θf 1 subject to Af y 2 σ. (6.8) ˆf Recalling from the previous discussions and Fig. 6.1, a random measurement of the 3D signal, f, made by coded exposure, is a linear combination of the pixels in a cylindrical spatial-temporal neighborhood. Therefore, unlike in compressive sensing for which the signal f has to be recovered as a whole via l 1 minimization, our inverse problem (6.8) can be broken into subproblems; a large number of W H T 3D sample blocks, W < N x, H < N y, can be processed in parallel, if high speed recovery of HFV is required. Separately solving (6.8) for W H T blocks does not compromise the quality of recovered HFV as long as the 2D image signal is sparse in the W H cross section of the block. On the contrary, this strategy can improve the quality of recovered HFV if overlapped domain blocks are used when solving (6.8). If a pixel is covered by m such domain blocks, then the solutions of the multiple instances of linear programming, one per domain block, yield m estimates of the pixel. These m estimates can be fused to generate a more robust final estimate of the pixel. 6.3 Capture and Recovery of Color Information The sensor array of digital cameras are blind, i.e. they can capture total light intensity that strike their surface without knowing the color of the light that actually hit them. 91

106 Beam Splitter Figure 6.2: Capturing color information with separate sensor arrays. To capture color information, most cameras use color filters to look at the light in its three primary colors; red, blue and green. The captured primary colors can then be used to create full spectrum of visible light. Digital cameras use different methods to capture color information of image/video signals. One of the methods, which is used in high quality professional and usually expensive cameras, is to use different sensor array for each primary color. In this method, depicted in Fig. 6.2, a beam splitter splits and directs the light to three sensor arrays. A color filter is placed in front of each sensor array which allows only one of the primary colors to reach the senor array underneath. This way, all sensor arrays have the same look at the scene but respond to only one of the primary colors. The advantage of this method is that it captures all three colors at each pixel location. However, the beam splitter reduces the light intensity reaching color filters to approximately 1/3 of input intensity which results in an increase in signal-to-noise ratio. To improve quality of captured signal, these cameras are usually equipped with expensive sensor arrays that can operate with low intensity. 92

107 Figure 6.3: Capturing color information with rotating color filters. The figure shows capturing red color of the scene. Two more consecutive captures will follow to record blue and green colors as well. Another method, as depicted in Fig. 6.3, is to use a single sensor array and place a rotating red, blue and green filters in front of it. Same as the previous method, this one captures all three colors at each pixel location but does not suffer from light deficiency as the input light is not split. However the drawback it that, since the three red, blue and green images are not taken at the same time, camera and the objects in the scene should remain motionless during capturing process. A more economical and practical approach to capture color information, which is widely used in consumer digital cameras, is to place a mosaic of tiny color filters called color filter array (CFA) over the sensor array. Each color filter of the CFA, filters the light by wavelength range and only allows an specific color to reach its corresponding pixel of the sensor array. The advantage of this method is that it uses a single sensor array and all color information is recorded at the same time. However unlike the other two methods, this method captures a single color at each pixel location. To fill out the missing color information, digital cameras use a demosaicking algorithm, designed for the CFA filter, to convert the raw captured image/video to a full color signal. 93

Ph.D. Thesis - Reza Pournaghi Captured Mosaicked Image Demosaicked Image Figure 6.4: Capturing color information with a single sensor array using Bayer pattern.

108 Ph.D. Thesis - Reza Pournaghi Captured Mosaicked Image Demosaicked Image Figure 6.4: Capturing color information with a single sensor array using Bayer pattern. The most commonly used CFA pattern is the Bayer filter (usually called RGGB filter), which alternates between a row of red and green filters and a row of green and blue filters (See Fig. 6.4). The number of green filters is twice as much as red and blue filters. The reason is that human eyes are more sensitive to green color, therefore including more information for green color creates an image that will be perceived as a true color by human eye. Red, green and blue filters are not the only filters that are used in CFAs. Other light filters include cyan (C) filter which passes green and blue colors, yellow (Y) filter which passes red and green colors, magenta (M) filter which passes red and blue colors and white (W) filter which is a transparent filter that passes all colors. Cameras used difference combinations of the above light filters in their CFA such as CYGM, RGBW, CYYM filters (See Fig. 6.5). The advantage of using color filters other than primary color filters (red, green and blue filters) is that more of the light reaches sensor array rather than being absorbed by the filter. This advantage, however, comes with the expense of color accuracy. This because when using CYM color filters, it is not possible to determine the individual intensities of the primary colors for each pixel. For example, assume that the light reaching an individual pixel has passed through a yellow filter. Although we know 94

Compressive Sensing for Multimedia. Communications in Wireless Sensor Networks

Compressive Sensing for Multimedia 1 Communications in Wireless Sensor Networks Wael Barakat & Rabih Saliba MDDSP Project Final Report Prof. Brian L. Evans May 9, 2008 Abstract Compressive Sensing is an