EFFICIENT H.264 VIDEO CODING WITH A WORKING MEMORY OF OBJECTS

Similar documents
A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Adaptive Silhouette Extraction and Human Tracking in Dynamic. Environments 1

Adaptive Silhouette Extraction In Dynamic Environments Using Fuzzy Logic. Xi Chen, Zhihai He, James M. Keller, Derek Anderson, and Marjorie Skubic

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Local Quaternary Patterns and Feature Local Quaternary Patterns

A Binarization Algorithm specialized on Document Images and Photos

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Reducing Frame Rate for Object Tracking

Fast Intra- and Inter-Prediction Mode Decision in H.264 Advanced Video Coding

Feature Reduction and Selection

Fast Intra- and Inter-Prediction Mode Decision in H.264 Advanced Video Coding

Wishing you all a Total Quality New Year!

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Object-Based Techniques for Image Retrieval

An efficient method to build panoramic image mosaics

A Gradient Difference based Technique for Video Text Detection

A Gradient Difference based Technique for Video Text Detection

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Improved H.264 Rate Control by Enhanced MAD-Based Frame Complexity Prediction

Detection of an Object by using Principal Component Analysis

Hybrid Non-Blind Color Image Watermarking

COMPLEX WAVELET TRANSFORM-BASED COLOR INDEXING FOR CONTENT-BASED IMAGE RETRIEVAL

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

Classification Based Mode Decisions for Video over Networks

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

MOTION PANORAMA CONSTRUCTION FROM STREAMING VIDEO FOR POWER- CONSTRAINED MOBILE MULTIMEDIA ENVIRONMENTS XUNYU PAN

An Entropy-Based Approach to Integrated Information Needs Assessment

Support Vector Machines

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Efficient Video Coding with R-D Constrained Quadtree Segmentation

Brushlet Features for Texture Image Retrieval

A Background Subtraction for a Vision-based User Interface *

Shape-adaptive DCT and Its Application in Region-based Image Coding

Simulation Based Analysis of FAST TCP using OMNET++

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

USING GRAPHING SKILLS

Fuzzy Filtering Algorithms for Image Processing: Performance Evaluation of Various Approaches

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Cluster Analysis of Electrical Behavior

X- Chart Using ANOM Approach

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Video Content Representation using Optimal Extraction of Frames and Scenes

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Comparison Study of Textural Descriptors for Training Neural Network Classifiers

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

An Image Fusion Approach Based on Segmentation Region

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

S1 Note. Basis functions.

An Optimal Algorithm for Prufer Codes *

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

CHAPTER 3 ENCODING VIDEO SEQUENCES IN FRACTAL BASED COMPRESSION. Day by day, the demands for higher and faster technologies are rapidly

Private Information Retrieval (PIR)

Lecture 5: Multilayer Perceptrons

Edge Detection in Noisy Images Using the Support Vector Machines

Real-time Motion Capture System Using One Video Camera Based on Color and Edge Distribution

Mathematics 256 a course in differential equations for engineering students

Background Removal in Image indexing and Retrieval

A DCVS Reconstruction Algorithm for Mine Video Monitoring Image Based on Block Classification

Machine Learning: Algorithms and Applications

Classifier Selection Based on Data Complexity Measures *

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

UB at GeoCLEF Department of Geography Abstract

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

PRÉSENTATIONS DE PROJETS

1. Introduction. Abstract

Combination of Color and Local Patterns as a Feature Vector for CBIR

Face Tracking Using Motion-Guided Dynamic Template Matching

Modular PCA Face Recognition Based on Weighted Average

SAO: A Stream Index for Answering Linear Optimization Queries

Enhanced AMBTC for Image Compression using Block Classification and Interpolation

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Multi-view 3D Position Estimation of Sports Players

PROJECTIVE RECONSTRUCTION OF BUILDING SHAPE FROM SILHOUETTE IMAGES ACQUIRED FROM UNCALIBRATED CAMERAS

2 ZHENG et al.: ASSOCIATING GROUPS OF PEOPLE (a) Ambgutes from person re dentfcaton n solaton (b) Assocatng groups of people may reduce ambgutes n mat

K-means and Hierarchical Clustering

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Semantic Image Retrieval Using Region Based Inverted File

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

CS 534: Computer Vision Model Fitting

IP Camera Configuration Software Instruction Manual

Coding Artifact Reduction Using Edge Map Guided Adaptive and Fuzzy Filter

2 optmal per-pxel estmate () whch we had proposed for non-scalable vdeo codng [5] [6]. The extended s shown to accurately account for both temporal an

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Algorithm for Human Skin Detection Using Fuzzy Logic

The Research of Support Vector Machine in Agricultural Data Classification

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

TN348: Openlab Module - Colocalization

Transcription:

EFFICIENT H.264 VIDEO CODING WITH A WORKING MEMORY OF OBJECTS A Thess presented to the Faculty of the Graduate School at the Unversty of Mssour-Columba In Partal Fulfllment of the Requrements for the Degree Master of Scence by WENQING DAI Prof. Zhha (Henry) He, Thess Supervsor December 2009

The undersgned, apponted by the dean of the Graduate School, have examned the thess enttled EFFICIENT H.264 VIDEO CODING WITH A WORKING MEMORY OF OBJECTS presented by Wenqng Da, a canddate for the degree of Master of Scence, and hereby certfy that, n ther opnon, t s worthy of acceptance. Professor Zhha (Henry) He Professor Marjore Skubc Professor Ye Duan

To my parents, Jnfeng Da and L Zhang, for ther everlastng love and support.

ACKNOWLEDGEMENTS I would lke to take ths opportunty to express my grattude to my advsor Professor Zhha He. I thank hm for offerng me the precous opportunty to become one member of vdeo research lab. Wthout hs excellent gudance, great patence and consstent support, I would not be able to complete my research. I am also extremely grateful to Professor Marjore Skubc, for her excellent gudance on my research and project, and support through all my course work. I thank Professor Ye Duan for hs helpful suggestons durng the revew of ths thess. I would lke also to thank Professor Tna Smlksten for offerng me the teachng experences durng my Master study. I would lke to thank Ms. Shrley Holdmerer for her assstance and great patence on my applcaton. I thank the members of the Vdeo Processng and Communcaton Lab: X Chen, York Chung, Jay Eggert, Xwen Zhao, Zhongna Zhou, Xn L, and vstng student Yongfe Zhang, Yunsheng Zhang. Ther great support and help made my lfe and work at MU a precous experence. I thank all my frends who cheered me up when I was down, offered me great help and shared great tme wth me. Especally I would express my most sncere thankfulness to Ja Yao for her great support and care to make me never gve up on my dream. At last, I would lke to gve my sncerely deepest grattude to my parents Jnfeng Da and L Zhang who, from the day I was born, gve me ther everlastng love to support me and love me. Thank my entre famly members, who are at the other sde of the earth, for ther never-stop love and support on me.

ABSTRACT Effcent spatotemporal predcton to remove the source redundancy s crtcal n vdeo codng. The newest nternatonal standard H.264 vdeo codng ntroduces several advanced features, such as multple-frame moton predcton and spatal ntra predcton [1], whch sgnfcantly mprove the overall codng effcency. In ths work, we focus on effcent H.264 vdeo codng for vdeo montorng and survellance. The vdeo camera, mostly statonary, watches the survellance scene contnuously, compresses the vdeo streams whch are then transmtted to a remote end for nformaton analyss or archved n a storage devce. In these types of vdeo montorng and survellance scenaros, the vdeo frame rate s often set relatvely low and the actvtes of persons n the scene often exhbt strong patterns whch mght repeat at dfferent spatotemporal scales. In ths work, we am to develop effcent methods to explot ths type of long-term source correlaton to mprove the overall vdeo compresson effcency. We propose a workng memory approach for effcent temporal predcton n H.264 vdeo codng. After vdeo frames are encoded, objects are extracted, analyzed, and ndexed n a dynamc database whch acts as a workng memory for the H.264 vdeo encoder. At the same tme, slhouettes are evaluated by usng dfferent compresson confguratons and comparng wth ground truth. Durng the encodng process, objects wth smlar spatal characterstcs are retreved from the workng memory and used for moton predcton of objects n the current vdeo frame. Ths approach extends the multple-frame estmaton and provdes a more generc framework for spatotemporal predcton of vdeo data. Our expermental results on ndoor actvty montorng vdeo data demonstrate that the proposed approach s able to save the codng bt rate by up to 35% wth a small computatonal overhead.

TABLE OF CONTENT ACKNOWLEDGEMENT... ABSTRACT... LIST OF TABLES... v LIST OF FIGURES... v Chapter 1. Introducton...1 1.1. Overvew...1 1.2. Motvaton...3 1.3. Major Contrbuton...8 1.4. Thess Organzaton...9 2. Background and Related Works...11 2.1. Background...11 2.2. H.264 Vdeo Codng...12 2.3. H.264 Multple Reference Frame Moton Predcton...13 2.4. Image Retreval...15 3. Effects on Slhouettes Introduced by Vdeo Compresson...20 3.1. Overvew...20 3.2. Slhouette Extracton...21 3.3. Slhouette Extracton n Workng Memory...26 3.4. Comparsons of Slhouettes wth Dfferent Vdeo Compresson Confguratons...27 v

4. Feature-Based Fast and Accurate Object Retreval...35 4.1. Overvew...35 4.2. Feature Based Object Retreval...36 4.3. Results and Analyss...40 5. Workng Memory Management...51 5.1. Overvew...51 5.2. H.264 Vdeo Codng wth Object Retreval and Matchng...53 5.3. Results and Analyss...55 6. Concluson and Future Work...56 6.1. Concluson...56 6.2. Future Work...56 REFERENCES...58 v

LIST OF TABLES Table Page 1.1 The mnmum SAD comparson between best matches and the prevous frame as moton predcton reference...4 4.1 SAD Comparson...42 4.2 Bt rate savng n H.264 vdeo codng...42 4.3 Bt rate comparson n H.264 vdeo codng...42 v

LIST OF FIGURES Fgure Page 1.1 Overvew of the proposed approach...2 1.2 Example vdeo frames 306-309 from Sequence2 and ther best match n prevously reconstructed vdeo frames....4 1.3 Average resdual SAD comparson on Vdeo_1 wth H.264 moton predcton and optmum search...6 1.4 Average resdual SAD comparson on Vdeo_2 wth H.264 moton predcton and optmum search...7 1.5 Average resdual SAD comparson on Vdeo_3 wth H.264 moton predcton and optmum search...8 2.1 Overvew of multple reference frame moton predcton...13 2.2 Overvew of mage retreval...15 2.3 Framework of CBIR system...17 3.1 Illustraton of brghtness and chromatcty dstorton... 23 3.2 Slhouette extracton and human detecton results for vdeo 1...25 3.3 Slhouette extracton and human detecton results for Vdeo 2...25 3.4 Slhouette extracton and human detecton results for Vdeo 3...26 3.5 Test sequence 1 Frame 80 (Orgnal Image) and ts slhouette (Ground Truth)...29 3.6 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=24...29 3.7 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=42...29 v

3.8 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=51...30 3.9 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=33 wthout deblockng flter...30 3.10 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=42 wthout deblockng flter...30 3.11 Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=51 wthout deblockng flter...31 3.12 Test sequence3 Frame 111 decoded mage and ts slhouette wth H.264 QP=33 wthout deblockng flter...31 3.13 Test sequence3 Frame 111 decoded mage and ts slhouette wth H.264 QP=42 wthout deblockng flter...31 3.14 Test sequence 3 Frame 111 decoded mage and ts slhouette wth H.264 QP=51 wthout deblockng flter...32 3.15 Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 1...33 3.16 Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 2...33 3.17 Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 3...34 4.1 Body centrod and dmensons, and hstogram of dmensons...37 4.2 (a) Slhouette mage from sequence 2. (b) Dstrbuton of vertcal dmensons of mage (a). (c) Dstrbuton of horzontal dmensons of mage (a)...37 4.3 Best matchng objects for vdeo frame 266-260 of test vdeo 1...39 4.4 Best matchng objects for vdeo frame 254-257of test vdeo 2...39 4.5 Best matchng objects for vdeo frame 337-340 of test vdeo 3...40 4.6 Average resdual SAD comparson on Vdeo_1 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work...43 4.7 Average resdual SAD comparson on Vdeo_2 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work...44 v

4.8 Average resdual SAD comparson on Vdeo_3 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work...45 4.9 H.264 encodng bts rate comparson on Vdeo_1 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research...46 4.10 H.264 encodng bts rate comparson on Vdeo_2 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research...47 4.11 H.264 encodng bts rate comparson on Vdeo_3 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research...48 4.12 Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_1...49 4.13 Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_2...49 4.14 Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_3...50 5.1 Illustraton for work memory...52 5.2 Overvew of H.264 vdeo encodng wth workng memory predcton...54 5.3 Expermental results wth dfferent szes of workng memory...55 x

Chapter 1 Introducton 1.1 Overvew Effcent spatotemporal predcton to remove the source redundancy s the key component n vdeo codng. H.264 vdeo codng ntroduces several advanced features, such as multple-frame moton predcton and spatal ntra predcton [1], whch sgnfcantly mprove the overall codng effcency. In ths work, we propose a workng memory approach to explot the long-term source redundancy for effcent H.264 vdeo codng of actvty montorng and survellance vdeos. We assume that the vdeo at each moment only has very few persons n the survellance scene and the persons may appear repeatedly n the scene at dfferent spatotemporal scales. Ths assumpton does hold n most actvty montorng and survellance vdeos. The proposed approach bulds up on object detecton, content analyss, and mage retreval. More specfcally, as shown n Fgure 1.1, after a vdeo 1

frame s beng encoded and reconstructed at the encoder sde, usng adaptve background modelng and slhouette extracton technque developed n our prevous work [2], we detect and extract persons from the frame. We extract shape and texture features to descrbe the object. We ndex the object and ts features n a database, called a workng memory. Durng H.264 vdeo encodng, we extract features from the nput frame and use these features to retreve the objects from the workng memory. We expect that these retreved objects wll have the hghest smlarty to objects n the current frame. We then use these objects to construct a reference pcture for moton predcton of the nput frame. The moton compensated resdual pcture s then encoded wth conventonal H.264 vdeo codng. We also develop a memory management module to determne whch subset of objects should be mantaned n the workng memory. Our expermental results demonstrate that the proposed method s able to save the codng bt rate by up to 35% wth a small computatonal overhead. Fgure 1.1: Overvew of the proposed approach. 2

1.2 Motvaton In ths work, we attempt to develop effcent methods to explot ths type of long-term source correlaton to mprove the overall vdeo compresson effcency. Multple-frame moton estmaton has been ntroduced n the H.264 vdeo codng to explore repeated moton n vdeos [3]. However, ts performance degrades sgnfcantly at lower frame rates. Immedate prevous frames may not be the best predcton reference for the current vdeo frame. For example, Fgure 1.2 shows an example of an n-home actvty montor vdeo at a frame rate of 2 frames per second (fps). The top row shows four consecutve vdeo frames, Frames 306, 307, 308, and 309. For each of these four frames (denoted by, 306, 307, 308, and 309), we use the followng brute-force approach to fnd ts best match n the prevous reconstructed frames: Let 1 1 be the set of prevous reconstructed frames. We use each reconstructed frame as reference to perform moton predcton of frame and let be the total SAD (sum of absolute dfference) of the correspondng resdual pcture after moton compensaton. The reconstructed frame whch yelds the mnmum s consdered as the best match to frame. The bottom row of Fgure 1.2 shows the best matches for Frames 306, 307, 308, and 309, whch are reconstructed frames 244, 82, 83, and 222, respectvely. The correspondng mnmum SAD values are shown n the second column of Table 1.1. The thrd column shows the mnmum SAD f we use the mmedate prevous frame as reference for moton predcton. We can see that, f we use the best match from the past hstory, the moton predcton resdual can be sgnfcantly reduced, whch wll cost much less bts durng compresson. Ths suggests that t s possble to sgnfcantly mprove the H.264 vdeo codng effcency by explotng ths type of long-term source redundancy for 3

actvty montorng and survellance vdeos. Fgure 1.2. Example vdeo frames 306-309 from Sequence2 and ther best match n prevously reconstructed vdeo frames. Table 1.1 The mnmum SAD comparson between best matches and the prevous frame as moton predcton reference. Mnmum SAD wth the Frame Best Match Prevous Frame 306 5.157 13.235 307 6.882 15.091 308 6.372 18.308 309 8.013 21.308 4

To verfy our observaton and assumpton, we compare the SAD (sum of absolute dfference) of the moton compensated dfference pcture after moton predcton. Fgure 1.3 to Fgure 1.5 shows the SAD of each frame obtaned by these three methods for Vdeo_1, Vdeo_2, and Vdeo_3, respectvely. (To show the results clearly, we splt the fgure nto two parts, each showng one half of vdeo frames.) We can see that the SAD obtaned by the optmum predcton s much smaller than that of the conventonal H.264 moton predcton. 5

Fgure 1.3: Average resdual SAD comparson on Vdeo_1 wth H.264 moton predcton and optmum search. 6

Fgure 1.4: Average resdual SAD comparson on Vdeo_2 wth H.264 moton predcton and optmum search. 7

Fgure 1.5: Average resdual SAD comparson on Vdeo_3 wth H.264 moton predcton and optmum search. Now, the challenge s how to fnd the best match for each frame to be encoded. As we know, moton estmaton s computatonally ntensve, especally wth multple reference frames as n H.264 [3-5]. In our case, we need to search all prevous frames to fnd the best match. The computatonal complexty becomes prohbtve as more and more frames are encoded and reconstructed for moton predcton references. Addtonally, how to select certan number frames to be stored n the workng memory s another challenge n ths research. 1.3 Major Contrbuton The man contrbutons of ths thess are as follow: Summarze the mpact on slhouettes extracton ntroduced by vdeo compresson. 8

Development of effcent feature based object retreval and matchng algorthm. We demonstrate that by usng slhouettes extracton, centrod, hstogram correlaton and hstogram of dmenson, we could fnd best match frame wthout performng brute full search n order to acheve better codng effcency compared to conventonal H.264. Development of workng memory desgn appled to H.264 encoder. Wth lmted buffer sze, the results show hgher codng effcency can be acheved compared to H.264 encodng wth small overhead. 1.4 Thess Organzaton The rest of the thess s organzed as follows: Chapter 2 revews the background and real applcaton of ths work. More advanced features related to H.264 codng are summarzed. After the background revew, some typcal algorthms for mprovng multple reference frames codng effcency for H.264 are ntroduced. Some mage retreval technques used for object matchng are explaned as well. Chapter 3 explans how the vdeo compresson has a sgnfcant mpact on slhouette extracton. We then explan how slhouette extracton s used n our H.264 workng 9

memory approach. At last, a set of results are provded and analyzed. Chapter 4 presents the man technques used n ths research for object retreval and matchng. The algorthm whch s based on Centrod, hstogram correlaton and hstogram of dmenson technques wll be formally ntroduced. Extensve expermental results and comparsons of ths method are provded. Chapter 5 revews the H.264 encodng and ntroduces workng memory management desgn. The workng memory s appled to H.264 encoder to mprove codng effcency. The ntegrated desgn wll be further explaned. Comparsons made among dfferent workng memory szes are dscussed as well. Chapter 6 summarzed the studes presented n ths thess. Conclusons are provded. Future work and drectons are dscussed. 10

Chapter2 Background and Related Work 2.1 Background In ths work, we focus on effcent H.264 vdeo codng for vdeo montorng and survellance. The vdeo camera, mostly statonary, watches the survellance scene contnuously, compresses the vdeo streams, whch are then transmtted to a remote end for nformaton analyss or archved n a storage devce. For example, n our on-gong research project on eldercare [2], we deploy a vdeo camera to montor elderly (often aged over 85) people s actvtes at home contnuously for automated functonal assessment and safety enhancement, such as detectng falls and abnormal stuatons whch mght ndcate changes n health condtons. In these types of vdeo montorng and survellance scenaros, the vdeo frame rate s often set very low, such as 2-5 frames per second, whch s suffcent for human trackng, actvty analyss, and scene understandng. The actvtes of persons n the scene often exhbt strong patterns whch mght repeat at dfferent spatotemporal scales. For example, every mornng about 11

6:30am, the person gets up and walks to bathroom. Ten mnutes later, he wll walk out of the bathroom towards the ktchen for breakfast. Durng the mornng tme, the person s often more actve, walkng around the home, dong exercses, cleanng, preparng meals, etc. Therefore, our goal s to develop effcent methods to explot ths type of long-term source correlaton to mprove the overall vdeo compresson effcency. 2.2 H.264 Vdeo Codng The key n effcent vdeo compresson s effcent spatotemporal predcton to remove the data redundancy n the spatotemporal doman. The H.264 vdeo codng standard ntroduces several advanced features, such as multple-frame moton predcton, varable block sze moton compensaton and sub-pxel moton estmaton [1], whch have sgnfcantly mproved the codng effcency. One of our central tasks n ths work s to fnd the best reference frame for current frame at a low computatonal complexty. The best reference frame s selected from prevous decoded N frames. To further reduce the spatotemporal redundancy, H.264 uses short and long term reference frame for more accurate moton predcton. In H.264 pctures that are encoded or decoded and avalable for reference are stored n the Decoded Pcture Buffer (DPB). All avalable reference frames are marked as short term reference pcture or long term reference pcture. Short term reference pctures wll be removed from DPB by an explct command or when the DPB s full. The frame marked as long term reference wll only be removed by an explct command, whch means long term reference pctures can be utlzed as the reference frames whch are not only wthn the 12

small search wndow. To be noted, a new nnovaton n H.264/AVC allows the motoncompensated predcton sgnal to be weghted and offset by amounts specfed by the encoder [1], whch wll dramatcally mprove codng effcency for scenes wth gradual transtons such as fades. As seen n pror MPEG standards, a sngle reference pcture s used n P frame and predcton s not scaled. In B frame, t uses two pctures as reference and the predcton s composed by equally averagng the weghtng factors. Comparng to prevous standards, H.264 assocates the weghtng factor wth reference pcture ndex whch s effcent for multple reference frame management. In H.264 explct mode, a weghtng factor and offset may be coded n the slce header for each allowable reference pcture ndex; n H.264 nexplct mode, the weghtng factors are derved based on the relatve pcture order count (POC) dstances of the two reference pctures [6]. 2.3 H.264 Multple Reference Frame Moton Estmaton Fgure 2.1: Overvew of multple reference frame moton predcton H.264 uses multple reference frames to acheve better predcton n many condtons. 13

Fgure 2.1 gves an overvew of H.264 multple reference frame moton estmaton. However, mult-frame moton predct dramatcally ncreases the computaton complexty of the encoder [7]. Several methods have been proposed to reduce the computatonal complexty. For example, the unsymmetrcal-cross mult-hexagon-grd search (UMHexagonS) has been proposed n [7]. Su and Sun [3], Hsao et al [8] and Duanmu et al [9] attempted to reduce the complexty usng contnuous trackng technques to provde a good startng pont for moton search. Methods based on trackng wll lkely fal n case of occlusons. Wegand et al [10] ntroduced a new moton search order based on the trangle nequalty for long-term moton predcton. An adaptve moton search scheme wth early termnaton and zero-block detecton has been developed n [11]. Huang et al [12] utlzed the nformaton from the prevous frame to determne whether the moton search on the remanng reference frames s needed or not. [13] examned avalable MV and SAD nformaton so as to termnate the mult-frame moton estmaton procedure. Wang et al [14] exploted the spatal correlaton between neghborng blocks to choose the best reference frame for the current block. Sohn and Km [15] determned the number of reference frames usng the correlaton between the block of current frame and that of prevous frame. Kapotas and Skodras [16] consdered the Lagrangan cost for reference frame selecton. We can see that exstng methods have been explorng the source correlaton between neghborng frames for fast and effcent moton predcton. Because of the dramatc ncrease of computatonal complexty n mult-frame moton predcton, typcally, up to 5 frames are used for moton predcton n practcal H.264 vdeo encodng [7]. Ths small wndow of reference frames lmts our capablty n explorng long-term source correlaton n vdeo data, especally n survellance vdeos. 14

2.4 Image Retreval In ths work, we consder all of the prevous decoded frames as a data base. Ths s n nature of mage retreval problems: to fnd an mage from the data base wth hghest smlarty. Fgure 2.2 provdes an overvew about mage retreval. Fgure 2.2: Overvew of mage retreval Wth the advances n computer technologes, there has been an exploson n the amount and complexty of dgtal mages beng generated. How to access the vast amount of data s a key challenge to allow people to browse, search and retreval effcently. Based upon ths fact, mage retreval has been a very actve topc snce the 1970s. Date back to 1970s, n tradtonal database people annotated mage wth a set of predefned key words whch are stored wth the correspondng mages and wll be matched wth user s queres, for example Chang s Query-by pctoral-example [17] and Pctoral Data-Base Systems [18]. The former one used a relatonal query language ntroduced for manpulatng queres regardng pctoral relatons as well as conventonal relatons [17]. The latter one proposed a pctoral data base whch s a collecton of sharable pctoral data encoded n 15

varous formats. A pctoral database system, or PDBS, provdes an ntegrated collecton of pctoral data for easy access by a large of number of users. But TBIR has to manually annotate the mages so t s not practcal to manually do the annotaton on the extremely large number of mages. The human percepton dfferences wll make the keywords dfferent even whch are used to descrbe the same mage. As well as some low level features lke color, texture and shape cannot be well captured by textual keywords. All these lead people to fnd the new method durng 1990 s, whch s called content-based mage retreval (CBIR). CBIR s a very actve research area for past decades. There are a large amount of researchers currently workng on CBIR. Man research ssues n CBIR nclude feature extracton, dmensonalty reducton, relevant feedback, etc. Because CBIR s most close to ths research, we wll focus on more aspects n CBIR. Fgure 2.3 shows a basc framework of a CBIR system [19]. Color s one of the most wdely used features of the great majorty of content-based mage retreval systems. The color feature s relatvely robust to background complcatons and ndependent of mage sze and orentatons. [20] ntroduced Chabot, whch bascally s an mage database whch adds color feature wth manually annotated keywords to search the mage. Swan [21] ntroduced a technque called Hstogram Intersecton, whch matches model and mage hstograms and allows real-tme ndexng nto a large database of stored models. Strcker [22] proposed two color ndexng technques. One s usng cumulated color hstogram whch has slghtly better performance than color hstogram but sgnfcantly more robust wth respect to the quantzaton parameter of the hstograms. The other approach s that the smlarty 16

functon whch s used for the retreval s a weghted sum of the absolute dfferences between correspondng moments. Smth and Chang [23,24] also proposed to dentfy the regons wthn mages that contan colors from predetermned colorsets. By searchng over a large number of color sets, a color ndex for the database s created n a fashon smlar to that for fle nverson whch allows very fast ndexng of the mage collecton by color contents of the mages. Fgure 2.3: Framework of CBIR system Texture s another feature whch s commonly used by CBIR system. It contans mportant nformaton about the structural arrangement of surfaces and ther relatonshp to the surroundng envronment [25]. Haralck got textural features derved from the 17

angular nearest-neghbor gray-tone spatal-dependence matrces at early 1970s[25]. Gotleb further extended ths dea and derved a general model for analyss and nterpretaton of expermental results n texture analyss when ndvdual and groups of classfers are beng used. They proposed to use sx representatve classfers whch are second angular moment f1, contrast f2, nverse dfference moment f5, entropy f9, and nformaton measures of correlaton I and II, f12 and f13, and t could gve a systematc study of the dscrmnaton power of all 63 combnatons of these classfers on 13 samples of Brodatz textures [26]. Later on, Tamura proposed to represent texture by sx vsual texture propertes whch are coarseness, contrast, drectonalty, lnelkeness, regularty and roughness [27]. All of the sx propertes are vsually meanngful. The Query by Image Content (QBIC) project studes the method to extend and complement text-based retrevals by queryng and retrevng mages and vdeos by content. Queres can be performed usng attrbutes such as colors, textures, shapes and object postons [28]. MARS (Multmeda Analyss and Retreval System) s a system that supports smlarty and content-based retreval of mages based on a combnaton of ther color, texture, shape and layout propertes [29, 30]. CBVQ developed by J. R. Smth focused on color and texture regon and used bnary set representatons of color and texture, respectvely [31, 32]. Shape representatons dependng on applcatons may requre transformaton nvarant. There are a lot of researches whch had been done n ths area. Fourer descrptor s one major achevement n ths area. It utlzes the Fourer transformed boundary as the shape feature. Ru proposed a Modfed Fourer Descrptor and a new dstance metrc for 18

descrbng and comparng closed planar curves for shape matchng n content based mage retreval system. Ther method accounts for the effects of spatal dscretzaton of shapes [33]. [34] generated an mage sgnature for each database pcture wth respect to key objects. WebSeer system s based on statstcal observatons about the mage content of the two types. The system uses mage contents lke colors and shapes to ndex mages [35]. Although prevous features could provde reasonable dscrmnatng power n mage retreval, the false postves become more as the mage collecton szes ncrease. Therefore, another method called color layout utlzed both color feature and spatal relatons came up. Rckman presented a novel mage codng scheme whch captures some of ths locally correlated color nformaton and mproves the selectvty of the retreval mechansm. Ther technque used a hstogram of features whch represent frequently occurrng local combnatons color tuples occurrng throughout the mage [36]. Huang proposed a new mage feature called the color correlogram and used t for mage ndexng and comparson. Ths feature dstlls the spatal correlaton of colors and s both effectve and nexpensve for content-based mage retreval [37]. It should be noted that the objectve of content-based mage retreval s to fnd mages from the database whch are perceptually, conceptually, or semantcally smlar to the query mage. It does not necessarly mnmze the dfferences between these two mages. Therefore, exstng features used for mage retreval cannot be drectly used n ths research for fndng the best moton match. 19

Chapter 3 Effects on Slhouettes Introduced by Vdeo Compresson 3.1 Overvew Extractng features to dfferentate foreground objects from background s the frst step of slhouettes. The slhouette extracton scheme s based on brghtness dstorton and chromatcty dstorton. Therefore, the slhouette extracton qualty wll be greatly dependng on the value of each pxel s change. Quantzaton, nvolved n mage processng, s a lossy compresson technque acheved by compressng a range of values to a sngle quantum value. When the number of dscrete symbols n a gven stream s reduced, the stream becomes more compressble. But the more the mage s compressed, the more nformaton you would lose due to the Quantzaton process. Ths chapter wll gve a detal explanaton on ths aspect. At last some comparsons are made between the conventonal H.264 and H.264 wthout deblockng flter. 20

3.2 Slhouettes Extracton In ths work, we use the slhouette extracton method developed n our prevous work [2] to extract persons from the vdeo scene. For the completeness of presentaton, we provde a bref revew of ths algorthm. Slhouette extracton, namely, segmentng a human body or objects from a background, s the frst and enablng step for many hgh-level vson analyss tasks, such as vdeo survellance, people trackng and actvty recognton [38-41]. We consder slhouette extracton as an adaptve classfcaton problem. We utlze mage features whch are nvarant to changes n lghtng condtons. Hgh-level knowledge s fused wth low-level feature-based classfcaton results to handle tmevaryng backgrounds changes. We consder slhouette extracton as an adaptve classfcaton problem. We utlze mage features whch are nvarant to changes n lghtng condtons. Hgh-level knowledge s fused wth low-level feature-based classfcaton results to handle tme-varyng backgrounds changes. Extractng features to dfferentate foreground objects from background s the frst step of slhouette extracton. A basc requrement s that features should be nvarant under brghtness changes. Further, t should be effectve n dfferentatng shadow from background. In ths work, we use two features: brghtness dstorton and chromatcty dstorton. More specfcally, we extract features n the RGB color space [42]. For adaptve background update, we use the past Δ frames for background modelng. At each pxel locaton, we compute the average values of ts RGB components n the past Δ frames and denote them by vector. We also calculate and standard devatons of the color components at each pxel. Let be the pxel n the 21

current frame. As shown n Fgure 3.1, we project the vector onto vector. We defne brghtness dstorton as: 2 2 2 2 B 2 G 2 R 2 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( arg mn + + + + = = I I I E I B B G G R R B B G G R R σ μ σ μ σ μ σ μ σ μ σ μ α α α, (1) and chromatcty dstorton as: 2 2 2 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( + + = = I I I E I CD B B B G G G R R R σ μ α σ μ α σ μ α α, (2) where [ represent the values of red, green and blue components of the pxel n the RGB color space. )] ( ), ( ), ( I I I B G R th [ ] ) ( ), ( ), ( B G R μ μ μ and [ ] ) ( ), ( ), ( B G R σ σ σ are the mean and standard devaton of these color components. Ths color model separates the brghtness from the chromatcty components as shown n Fg. 4. It has been found that the chromatcty dstorton s nvarant under brghtness changes [42]. Our foregroundbackground classfcaton s based on the followng two observatons: (1) mage pxels n the background often have lttle change n ther chromatcty dstorton; and (2) shadow often causes brghtness dstorton but lttle chromatcty dstorton. Based on these two observatons, we establsh the followng decson rules for foreground, background, and shadow detecton: (1) f the chromatcty dstorton s large, s a foreground pxel; (2) f the chromatcty dstorton s small and the brghtness dstorton s about 1.0, t s a background pxel; (3) f chromatcty dstorton s small and the brghtness dstorton smaller than 1.0, t s a shadow pxel. 22

Fgure 3.1 Illustraton of brghtness and chromatcty dstorton. In slhouette extracton wthn a dynamc vdeo scene, we need to contnuously update the background model by ncorporatng background changes. A commonly used method to update background s that, f an object or mage area remans statonary for a certan perod of tme, t s consdered to be background. Here, we use the past frames to update the background model. For accurate slhouette extracton, we want to be small so that the background can be quckly updated. However, when s small, the human body could be easly updated as background f the person does not move for a whle, for example, sttng stll on a char for a few mnutes. To solve ths problem, we propose to utlze hgh-level knowledge about human moton as a gudelne to perform adaptve update of the background model. Many sophstcated human trackng algorthms have been developed n the lterature [43, 44, 45]. However, they often have hgh computatonal complexty. Here, to acheve lowcomplexty, we use a smple block-based moton estmaton whch has been extensvely used n vdeo codng [7]. More specfcally, suppose that we have obtaned the slhouette for frame. We fnd a boundng box for the slhouette such that 95% of foreground 23

pxels are ncluded. For each mage block wthn the boundng box n the current frame, we fnd ts best match n the next frame 1 usng SAD (sum of absolute dfference) as a dstance metrc. To speed up the moton estmaton process, we use a fast algorthm called damond search [46]. Once the moton vectors of all blocks are obtaned after block-based moton estmaton, we take ther average to predct the human body poston (or the center of ts boundng box) n the next frame 1. Those mage blocks whch contan the human body should be updated very slowly so that the human body won t be absorbed nto the background. Those blocks outsde the predcted body regon can be updated much faster to make sure that new objects are quckly absorbed nto the background. After background update and slhouette extracton, we update the dmenson, heght and wdth, of the boundng box n frame 1. Fgure 3.2 to Fgure 3.4 show the slhouette extracton and human trackng results for some sample frames of three n-door actvty montorng vdeos. It can be seen that ths algorthm s able to obtan hgh-qualty of human slhouettes and track persons accurately. 24

Fgure 3.2: Slhouette extracton and human detecton results for Vdeo 1 Fgure 3.3: Slhouette extracton and human detecton results for Vdeo 2 25

Fgure 3.4: Slhouette extracton and human detecton results for Vdeo 3. 3.3 Slhouettes Extracton n Workng Memory In ths research, we apply the slhouette extracton algorthm on the orgnal vdeo frame F N and obtan the bnary slhouette mage. Usng ths bnary mage as mask, we segment the human object from the background and denote t by O N. Let B N be the correspondng background mage constructed by the slhouette extracton algorthm. We then extract a set of vsual feature, denoted by f N, from O N to characterze the human object. As llustrated n Fgure 1.1, after frame O N s encoded by H.264, we also apply the same 26

slhouette extracton algorthm to the encoder reconstructon F N and obtan the reconstructed human object O N. Let B N be the correspondng background mage. We also extract a set of features, denoted by f N, to characterze O N. Both O N and ts feature vector f N are ndexed and stored nto a database, called workng memory n ths work. At ths moment, let us assume that Ω O 1 k N 1 are all avalable n the workng memory when frame F N s beng encoded. We use feature f N as query nput to the workng memory to retreve the object whch matches the current object O N best. We denote ths best match by O. Here, denotes frame numbers less than N. We overlay O on the background mage B N to form a reference frame F for moton predcton. We expect that, usng F as the moton predcton reference, the moton compensated dfference wll be mnmzed. For convenence, we refer to ths type of moton predcton approach as workng memory predcton (WMP). If the best match happens to be O N, then F wll be exactly the prevous frame F N. Therefore, the conventonal H.264 moton predcton (P-frame) s a specal case of the proposed WMP scheme. 3.4 Comparsons of Slhouettes wth Dfferent Vdeo Compresson Confguratons In practce, due to lmted transmsson bandwdth or storage space, vdeos are often compressed wth JPEG, MPEG, or H.264 codng scheme. Therefore, t s necessary to understand the performance of slhouette extracton on compressed vdeo frames and nvestgate how the compresson artfacts could mpact the performance of slhouette 27

extracton. Dependng on dfferent confguratons n vdeo compresson scheme, the degree of performance degradaton n slhouette extracton vares. In ths secton, we conduct extensve experments to evaluate the mpact of dfferent mage/vdeo compresson schemes on the slhouette extracton performance. Frst, we confgure the H.264 to the default settngs. We tested on three set of sequences whch have the ground truth for each sequence. In order to get dfferent qualty mage from the vdeo compresson, we encoded each sequence wth dfferent quantzaton parameters (QP). QP, a settng n vdeo codng that controls the qualty of vdeo compresson. In H.264 as QP s ncreased, the qualty of the vdeo decreases. Then we extracted the slhouettes from the decoded mages and compared wth the ground truth. The results of sequence1 are selectvely summarzed from Fgure 3.5 to Fgure 3.8. Second, we turned off the deblockng flter functon n H.264 whch mght ntroduce more errors due to the edge effects caused by block based the moton estmaton. Same as prevous step, after we decoded the mages whch are compressed at dfferent QP level, we extracted slhouettes of these sequences. Then, we compare the slhouettes results wth the ground truth. Fgure 3.9 to Fgure 3.14 are the pcked sample results for sequence 1. 28

Fgure 3.5: Test sequence 1 Frame 80 (Orgnal Image) and ts slhouette (Ground Truth). Fgure 3.6: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=24. Fgure 3.7: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=42 29

Fgure 3.8: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=51 Fgure 3.9: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=33 wthout deblockng flter Fgure 3.10: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=42 wthout deblockng flter 30

Fgure 3.11: Test sequence 1 Frame 80 decoded mage and ts slhouette wth H.264 QP=51 wthout deblockng flter Fgure 3.12: Test sequence3 Frame111 decoded mage and ts slhouette wth H.264 QP=33 wthout deblockng flter Fgure 3.13: Test sequence3 Frame111 decoded mage and ts slhouette wth H.264 QP=42 wthout deblockng flter 31

Fgure 3.14: Test sequence 3 Frame 111 decoded mage and ts slhouette wth H.264 QP=51 wthout deblockng flter After we decoded the vdeo and extracted the slhouettes, we compared each sequence to the ground truth whose value s denoted as. s the total pxel number per frame, whch s the foreground of the ground truth. Then we calculate the average error rate R for each vdeo sequence wth n frames. We denote the average error rate as follow: 100% 100% 3 Where s the total error pxel numbers per frame, whch s composed of two type of errors: the error from false detectng background pxel as foreground pxel denoted as and the error from false detectng foreground pxel as background pxel denoted as. Fgure 3.15 to Fgure 3.17 show the pcture PSNR v.s. error rates. 32

Fgure 3.15: Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 1 Fgure 3.16: Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 2 33

Fgure 3.17: Error rate vs PSNR comparson between H.264 and H.264 wthout deblockng flter for Sequence 3 From the results, we can see that, as PSNR decreases (whch means as QP ncreases), the qualty of slhouettes decreases and the error rate of slhouette extracton ncreases. Especally after a certan level, the Error Rate of slhouettes ncreases dramatcally. Comparng the slhouette results whch are from the default H.264 settngs to the slhouettes results whch are from the H.264 wthout deblockng flters, the latter one wll ntroduce more errors compared to the former n respect to the same PSNR. Therefore the slhouettes results from the latter one are worse than the slhouettes from the former one. The research helps us realze the relatonshps between the vdeo compresson technques and the slhouette extracton technque. It s fundamental for achevng better codng effcency by usng ths proposed algorthm. 34

Chapter 4 Feature Based Fast and Accurate Object Retreval 4.1 Overvew As dscussed n the prevous chapter, we need to fnd the best object from the workng memory such that the moton compensated dfference between and s mnmzed. We denote ths dfference by,. Note that typcally the background models at frames 1 and wll be very close to each other, except sudden lghtng condton changes. In ths case, an effectve encodng opton s to termnate the moton predcton chan and to use an INTRA frame. Therefore, the major moton compensated dfference wll be from the foreground objects, e.g. persons. We have,,. 4 The process of fndng the best ma tch can be summarzed as follows: Ω arg, 5 35

To compute,, we need to perform moton predcton and compensaton between and. As we know, moton estmaton s computatonally ntensve. Furthermore, the sze of the workng memory,.e., the total number of canddate objects n t, ncreases wth the frame number. Therefore, the computatonal complexty n fndng the best match Ω n (3) wll become prohbtve when s large. Now, the queston s: how to fndng the best match Ω n (3) wth low computatonal complexty? In ths research, we propose to explore a content-based mage retreval method. We wll extract a set of features to descrbe the reference object n the workng memory and the current object, respectvely. We then attempt to fnd the best match n the feature space. As we know, feature-based matchng has much low computatonal complexty than drect moton predcton. Essentally, the problem can be summarzed as follows: fndng the best moton match wthout performng moton predcton. Ths s a non-trval task. 4.2 Feature Based Object Retreval When defnng the features, we need to make sure that the best match n the feature space yelds the mnmum moton compensated dfference. Based on our extensve smulaton experence, we fnd that the followng features are suffcent for our purpose: (1) body centrod; (2) hstograms of horzontal and vertcal dmensons, and (3) color hstogram. More specfcally, after slhouette extracton, we compute the centrod of the foreground pxels. We also scan the foreground mage horzontally and vertcally and record the 36

number of pxels n each row and column, as llustrated n Fgure 4.1. We refer to ths nformaton as hstograms of horzontal and vertcal dmenson. Fgure 4.2 (b) and (c) show the hstograms of horzontal and vertcal dmensons for the slhouette mage n Fgure 4.2 (a). Ths provdes a smple yet effcent characterzaton of the body shape and sze of the slhouette whch mples the dstance between the person and the camera. The thrd feature, color hstogram, descrbes the content nsde the object. (A) (B) Fgure 4.1: Body centrod and dmensons, and hstogram of dmensons. Fgure 4.2: (a) Slhouetter mage from sequence 2. (b) Dstrbuton of vertcal dmensons of mage (a). (c) Dstrbuton of horzontal dmensons of mage (a). 37

Let and be the object from the current frame and the object to be matched n the workng memory. Let, and, be ther centrods. Let and be the dstrbutons (or normalzed hstograms) of horzontal and vertcal dmensons, respectvely. Here, we use and to ndex horzontal and vertcal pxel postons. For the centrod feature, we use ther Eucldean dstance, denoted by. For the dstrbutons of horzontal and vertcal dmensons, we consder them as probablty dstrbutons and use the Kullback Lebler dstance [47] whch s defned as:, log, 6, log. 7 We then form the followng dstance metrc for object retreval from the workng memory:,,,, 8 where s a normalzaton factor on the centrod dstance. The object n the workng memory wth the mnmum dstance from the current object s the best match Ω and used for moton predcton of the current frame. Usng ths dstance metrc, we fnd the best match object from the workng memory and overlay t on the background mage to form the reference frame for moton 38

predcton of the current frame. From Fgure 4.3 to Fgure 4.5, we show the best matchng object for vdeo frames of 3 test vdeos. The top row s the frames to be encoded and the bottom row s the correspondng frames from whch the best matchng objects are extracted. Fgure 4.3: Best matchng objects for vdeo frame 266-260 of test vdeo 1 Fgure 4.4: Best matchng objects for vdeo frame 254-257of test vdeo 2 39

Fgure 4.5: Best matchng objects for vdeo frame 337-340 of test vdeo 3 4.3 Results and Analyss We have mplemented the proposed workng memory predcton scheme n H.264 JM 15.1 [48]. We test the performance of the vdeo encoder wthn the context of ndoor actvty montorng. The three test vdeos, labeled as Vdeo_1, Vdeo_2, and Vdeo_3 are shown n Fgure 4.3, Fgure 4.4 and Fgure 4.5, respectvely. We use I and P frames wth a GOP (group of pctures) sze of 32. We turn off rate control and use a constant quantzaton parameter for all I and P frames to acheve near-constant vdeo qualty. We compare the performance of the followng three encodng schemes: (A) conventonal H.264 vdeo encodng, (B) H.264 vdeo encodng wth optmum predcton whch fnd the best match from all the prevous reconstructed frames usng brute-force moton search, and (C) H.264 vdeo codng wth workng memory predcton. Frst, we compare the SAD (sum of absolute dfference) of the moton compensated dfference pcture after moton predcton. Fgure 4.6 to Fgure 4.8 show the SAD of each 40

frame obtaned by these three methods for Vdeo_1, Vdeo_2, and Vdeo_3, respectvely. (To show the results clearly, we splt the fgure nto two parts, each showng one half of vdeo frames.) We can see that the SAD obtaned by the optmum predcton s much smaller than that of the conventonal H.264 moton predcton, and the SAD value obtaned by our workng memory predcton s very close to the optmum. Ths mples that the proposed low-complexty feature-based object retreval scheme s able to accurately fnd the near-optmum moton match. These results are summarzed n Table 4.1. Next, we compare the encodng bt rates. We set the target vdeo qualty to be 35 db by choosng a smlar quantzaton parameter. Fgure 4.9 to Fgure 4.11 show the encodng bts of each frame when these three predcton methods are appled. The results are summarzed n Table 4.2 and Table 4.3. By usng the proposed workng memory predcton scheme, we can acheve an average bt rate savng of 25-27%. The maxmum bt savng on vdeo frames can even go up to 77.5%. The proposed feature-based object retreval scheme approaches the optmum performance, only about 0.1-5% of performance loss n bt savng. Fgure 4.12 to Fgure 4.15 show the rate-dstorton (PSNR) comparson between the conventonal H.264 vdeo codng and ths work on three test vdeos. We can see that the workng memory predcton acheves about 1.2-1.5 db mprovement n average PSNR. 41

Table 4.1 SAD Comparson. Average SAD Savng Comparson (SAD/pxel) Maxmal SAD Savng Comparson (SAD/pxel) Vdeo H.264 Optmum Predcton Ths work H.264 Optmum Predcton Ths work SAD SAD Savng (%) SAD Savng (%) SAD SAD Savng (%) SAD Savng (%) 1 8.79 3.61 59.94 4.39 50.08 10.80 1.48 86.29 1.48 86.29 2 8.46 4.65 45.01 5.16 39.02 15.89 4.27 73.15 4.27 73.15 3 4.67 1.84 60.67 3.73 20.10 8.09 0.80 90.08 0.80 90.08 Table 4.2 Bt rate savng n H.264 vdeo codng Vdeo Bt Rate Savng (%) from H.264 Optmum Feature-based Avg Max Avg Max 1 32.5% 77.5% 27.2% 77.5% 2 30.8% 71.4% 25.7% 71.4% 3 16.0% 73.6% 25.9% 72.0% Vdeo Table 4.3 Bt rate comparson n H.264 vdeo codng Bt Rate Comparson from H.264 (kbts/s at 30 Hz) Optmum Feature-based 1 265.99 182.70 2 462.68 326.84 3 163.40 126.50 42

Fgure 4.6: Average resdual SAD comparson on Vdeo_1 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work. 43

Fgure 4.7: Average resdual SAD comparson on Vdeo_2 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work. 44

Fgure 4.8: Average resdual SAD comparson on Vdeo_3 wth H.264 moton predcton and optmum search and the proposed algorthm n ths work. 45

Fgure 4.9: H.264 encodng bts rate comparson on Vdeo_1 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research. 46

Fgure 4.10: H.264 encodng bts rate comparson on Vdeo_2 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research. 47

18000 16000 14000 H.264 predcton Optmum predcton Ths work 12000 Bts/Frame 10000 8000 6000 4000 2000 0 400 450 500 550 600 650 700 Frame Number Fgure 4.11: H.264 encodng bts rate comparson on Vdeo_3 wth H.264 moton predcton, optmum search, and the proposed algorthm n ths research. 48

40 38 H.264 Ths work 36 PSNR 34 32 30 0 200 400 600 800 1000 1200 1400 1600 Bt Rate(Kbts/s) Fgure 4.12: Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_1 39 38 37 H.264 Ths work 36 PSNR 35 34 33 32 31 0 100 200 300 400 500 600 Bts Rate(Kbts/s) Fgure 4.13: Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_2 49

40 39 38 37 H.264 Ths work PSNR 36 35 34 33 32 31 50 100 150 200 250 300 350 Bt Rate(Kbts/s) Fgure 4.14: Rate-dstorton performance comparson wth conventonal H.264 vdeo codng on Vdeo_3 50

Chapter 5 Workng Memory Management 5.1 Overvew As more vdeo frames are beng encoded, more objects are beng added nto the workng memory. Ths wll requre larger memory and hgher mplement cost. It wll also ncrease the computatonal complexty n object retreval. Therefore, there s a need to develop an effcent workng memory management scheme to control the memory cost and complexty of the workng memory. More specfcally, we need to control the total number of objects n the workng memory. Our proposed scheme for workng memory management s based on the followng observatons. Frst, there s no need to store objects that are very smlar to each other. Second, objects mantaned n the memory should cover dfferent human actons, poses, or appearances as many as possble. They should be qute dfferent from each other. Therefore, we propose to use the dstance metrc n (8) for dynamc workng memory management. More specfcally, when a new frame s beng encoded, we retreve the best 51

Fgure 5.1: Illustraton for work memory 52

matchng from the memory for moton predcton. Therefore, we know the dstance between the current object and each object n the workng memory. In ths way, we always have the dstance between any two objects n the workng memory. When the current object jons the workng memory, we remove the object whch has the smallest dstance from other objects. Fgure 5.1 shows an example where a maxmum of 50 objects are beng mantaned n the workng memory at 3 dfferent tme nstances. 5.2 H.264 Vdeo Codng wth Object Retreval and Matchng In ths Secton, we descrbe the H.264 vdeo encodng scheme based on workng memory predcton. Fgure 5.2 shows a block dagram of the modfed H.264 vdeo encoder. The proposed encodng scheme has the followng major steps. Let be the current vdeo frame. Step 1. Slhouette extracton and foreground object detecton. Apply the slhouette extracton algorthm descrbed n Secton III to the current frame and obtan the foreground object. Step 2. Object retreval from the workng memory. From the foreground object, we extract ts features, namely, ts centrod and hstograms of horzontal and vertcal dmensons. Usng the dstance metrc defned n (7), we fnd the retreve the best match object Ω from the workng memory Ω. Step 3. Constructng the moton predcton reference. Overlay the object Ω on top of the background mage obtaned from slhouette extracton of the 53

reconstructed frames to construct a workng memory predcton Ω. Step 4. H.264 encodng. Usng Ω as the reference frame for moton predcton, encode the resdual wth the H.264 encoder, and reconstruct the current frame. Step 5. Workng memory management. Usng the slhouette as mask to extract the reconstructed foreground object. Extract ts features and store them nto the workng memory. Usng the procedure descrbed n Secton V to update the workng memory. Fgure 5.2: Overvew of H.264 vdeo encodng wth workng memory predcton The major computatonal complexty of the proposed algorthm les n slhouette extracton. Our current slhouette extracton algorthm s able to run 10-15 frames per second on 640 480 mages. The feature-based matchng process has low computatonal complexty, especally when the number of objects stored n the workng memory s small. For example, n ths work, we set the number n the range of [50, 300], 54