applications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation

Similar documents
Inside Intel Core Microarchitecture

Intel 64 and IA-32 Architectures Software Developer s Manual

Next Generation Technology from Intel Intel Pentium 4 Processor

Computer System Architecture

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Intel s MMX. Why MMX?

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Intel 64 and IA-32 Architectures Software Developer s Manual

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

INTEL PENTIUM Gold AND CELERON PROCESSORS

IA-32 Intel Architecture Software Developer s Manual

Supra-linear Packet Processing Performance with Intel Multi-core Processors

Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results

Installation Guide and Release Notes

IA-32 Intel Architecture Software Developer s Manual

MICROPROCESSOR TECHNOLOGY

Multi-core Programming Evolution

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Best Practices for Setting BIOS Parameters for Performance

Core. for Embedded Computing. Processors T9400, P8400, SL9400, SL9380, SP9300, SU9300, T7500, T7400, L7500, L7400 and U7500.

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel Compute Card Slot Design Overview

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

9 GENERATION INTEL CORE DESKTOP PROCESSORS

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Moorestown Platform: Based on Lincroft SoC Designed for Next Generation Smartphones

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

Intel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation

Expand Your HPC Market Reach and Grow Your Sales with Intel Cluster Ready

Intel Xeon Processor E v3 Family

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

Dan Stafford, Justine Bonnot

Moving from 32 to 64 bits while maintaining compatibility. Orlando Ricardo Nunes Rocha

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Processing Unit CS206T

Parallelized Progressive Network Coding with Hardware Acceleration

MpegRepair Software Encoding and Repair Utility

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

A Simple Path to Parallelism with Intel Cilk Plus

Legal Notices and Important Information

Performance Optimization of an MPEG-2 to MPEG-4 Video Transcoder

Benchmark Performance Results for Pervasive PSQL v11. A Pervasive PSQL White Paper September 2010

Intel Xeon Phi Coprocessor Performance Analysis

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

MPEG Decoding Workload Characterization

Fundamentals of Computer Design

Parallelism and Performance Instructor: Steven Ho

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

IBM Power Systems solution for SugarCRM

Intel released new technology call P6P

This Material Was All Drawn From Intel Documents

More performance options

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Intel Stress Bitstreams and Encoder (Intel SBE) 2017 AVS2 Release Notes (Version 2.3)

Intel Enterprise Processors Technology

Intel Core X-Series. The Ultimate platform with unprecedented scalability. The Intel Core i9 Extreme Edition processor

POWER YOUR CREATIVITY WITH THE INTEL CORE X-SERIES PROCESSOR FAMILY

FAST FORWARD TO YOUR <NEXT> CREATION

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

IA-32 Intel Architecture Optimization

Installation Guide and Release Notes

Software Occlusion Culling

Characterization of Native Signal Processing Extensions

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Is Intel s Hyper-Threading Technology Worth the Extra Money to the Average User?

Simultaneous Multithreading on Pentium 4

Intel Many Integrated Core (MIC) Architecture

Online Course Evaluation. What we will do in the last week?

Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

WHITE PAPER ON2 TECHNOLOGIES, INC. TrueMotion VP7 Video Codec. January 10, 2005 Document Version: 1.0

Nehalem Hochleistungsrechnen für reale Anwendungen

Intel Hyper-Threading technology

Intel NetStructure DMN160TEC ISDN Call Control Performance Testing

Intel 64 and IA-32 Architectures Optimization Reference Manual

Graphics Performance Analyzer for Android

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Intel 64 and IA-32 Architectures Optimization Reference Manual

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Fundamentals of Computers Design

Video Compression An Introduction

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Hyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.

Transcription:

White Paper Intel Digital Security Surveillance Optimizing Video ompression for Intel Digital Security Surveillance applications with SIMD and Hyper-Threading Technology by hew Yean Yam Intel orporation Abstract The growing consciousness of the general public, government, and private industry drives the demand for tighter security. With the advent of new technologies, Digital Security Surveillance (DSS) is becoming an essential part of daily life. Today s sophisticated surveillance solutions require scalability to support high numbers of video channels, higher throughput at the input/output (I/O) interface, processor performance to handle sophisticated algorithms, and reliable high-volume data storage. Video compression is one of the most computationally demanding modules in DSS applications. Optimization in software can reduce the video compression burden on the processor, freeing processor resources for other important applications, including video analysis and graphical user interfaces. This article provides an introduction to the video compression performance improvements that can be achieved by features enabled in current Intel Architecture Processors.

Introduction The current generation of IA-32 processors, including Intel Pentium 4 processor, Intel Xeon TM processor, and Intel Pentium M processor, includes multiple features designed to enhance performance. These include Single Instruction Multiple Data (SIMD) instruction set extensions, Hyper-Threading Technology (HT Technology), 1 microarchitectures that enable instruction execution at high clock rates, and a high-speed cache hierarchy. Single Instruction Multiple Data (SIMD) The SIMD computational technique was introduced in the IA-32 architecture with MMX technology and then further enhanced with Intel s introduction of Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (), and Streaming SIMD Extensions 3 (SSE3). SSE,, and SSE3 extend MMX technology by providing additional processor instructions for single-precision and double-precision floating-point operations, in addition to adding 128-bit registers and instructions to improve synchronization between multi-threaded agents, Table 1. SIMD extensions improve the performance of applications characterized by i) inherent parallelism, ii) recurring memory access patterns, iii) localized recurring operations performed on the data, and iv) data-independent control flow. SIMD addresses the needs of multimedia, communications, and graphics/video applications used in DSS. These applications often use sophisticated algorithms to perform the same operations repetitively on large amounts of data. Such applications include 3-D graphics rendering, 3-D geometry, image processing, and video encoding and decoding. omputing without SIMD is known as scalar processing, in which one instruction is performed on one data point at a time. omputation with SIMD enables the processor to execute one instruction on multiple data points concurrently, which increases the amount of data that can be processed in a given time interval, Figure 1. MMX TM Technology SSE SSE3 64-bit MMX TM Technology Registers 128-bit XMM Registers 128-bit XMM Registers 128-bit XMM Registers Packed byte, word, double word integers Packed single-precision floating-point operands Packed double-precision floating-point operands Floating-point instructions for asymmetric and horizontal computation and for thread synchronization. Table 1 The next generations of Simple Instruction Multiple Data (SIMD), Streaming SIMD Extensions (SSE),, and SSE3 are extensions of Intel s original MMX technology.they provide additional instructions, enabling single-precision and double-precision floating-point operations, in addition to adding 128-bit registers and instructions to improve synchronization between multi-threaded agents. 2

Scalar processing Single Instruction Multiple Data (SIMD) Processing X X X3 X2 X1 X + + Y Y Y3 Y2 Y1 Y X + Y X+Y X3+Y3 X2+Y2 X1+Y1 X+Y Figure 1 omparison between computing with and without Single Instruction Multiple Data (SIMD) technique. SIMD improves performance over scalar processing by enabling the concurrent execution of one instruction on multiple data points. Hyper-Threading Technology (HT Technology) HT Technology enables software to take advantage of tasklevel or thread-level parallelism by providing multiple logical processor resources within a physical processor package. The operating system and applications perceive the single physical processor as two logical processors by maintaining the architectural state of these two processors. HT Technology provides more efficient multitasking and system responsiveness by allocating processor resources to applications as needed, enabling the processor to complete more tasks at a given time, and by efficiently sharing resources that would otherwise be idle, Figure 2. HT Technology can provide a significant performance gain in many current DSS applications, such as those that involve multiple video data streaming. This technology is supported by specific members of Intel Pentium 4 processor and Intel Xeon processor product lines. How Hyper-Threading Technology Works Without HT Technology Physical processors Thread 2 Thread 1 Logical processors visible to OS Resource 1 Resource 2 Resource 3 Time Physical processor resource allocation Throughput Thread 1 With HT Technology Thread 2 Resource 1 Resource 2 Resource 3 + Figure 2 Hyper-Threading Technology boosts performance by sharing otherwise idle resources. 3

Video Encoding Figure 3 illustrates the video encoding process, which is an essential function in many DSS applications. The highlighted modules are computationally intensive functions that can be optimized with the SIMD technique using MMX technology, SSE, and technology. Image Video Encoding Decoded previous Decoded previous or next image(s) image Motion Motion Estimation ompensation Motion vector Image Difference image Discrete osine Transform (DT) DT coefficient Quantization Quantization coefficient oding Bitstream Figure 3 The video encoding process, with highlighted areas that show where SIMD-based computation can accelerate application performance. Measuring Performance Gains Motion Estimation Motion ompensation 3 2 3 17 PU ycle 2 2 1 PU ycle 1 12 1 7 1 2 GetSADHalfXY GetSADHalfY GetSADHalfX SubBlockHalfXY SubBlockHalfY SubBlockHalfX SubBlock GetSAD GetSumDev MovBlock 1 MovBlock 2 AddlipBlock DT/IDT Quantization/De-quantization 2 3 2 3 2 PU ycle 1 1 PU ycle 2 1 1 IDT DT DequantIntra DequantInter QuantlipIntra QuantInter Figure 4 Graphs illustrate processor cycles required for motion estimation, motion compensation, Discrete osine Transform (DT)/ Inverse Discrete osine Transform (IDT), and quantization/de-quantization.the colored bars provide a comparison of processor cycles required when the functions are implemented in language and when they are optimized using Intel MMX TM technology, SSE, and. Source: Huper Laboratory o., Ltd. 4

All the tests are run on an Intel Pentium 4 processor 2.4 GHz with Intel 91G Express chipset-based platform. The first experiment observed the number of clock cycles required by executing the functions implemented in, MMX technology, SSE, and within the 4 major modules in video compression. The test is conducted by executing the functions for 1,, times and the average is obtained to ensure a precise value. As shown in Figure 4, the number of processor cycles needed to perform video encoding functions drops dramatically with the use of later generations of SIMD instruction sets. More operations can be carried out in the same amount of time, resulting in higher performance. The significant drop in processor cycles correlates with a significant performance boost compared to -based computation. The SUM function provides a general indication of the performance gain for each module. Results from motion estimation, motion compensation, DT/IDT, and quantization/de-quantization all exhibit this general trend. Motion Estimation Motion ompensation 2 1 1 3 3 2 2 1 1 GetSumDev GetSAD GetSADHalfX GetSADHalfY GetSADHalfXY SUM AddlipBlock MovBlock 1 MovBlock 2 SubBlock SubBlockHalfX SubBlockHalfY SubBlockHalfXY SUM SSE MMX Technology SSE MMX Technology DT/IDT Quantization / De-quantization 8 7 6 4 3 2 1 DT IDT 2 18 16 14 12 1 8 6 4 2 QuantInter QuantlipIntra DequantInter DequantIntra SUM SSE MMX Technology SSE MMX Technology Figure These graphs illustrate the optimization factor for motion estimation, motion compensation, DT/IDT, and quantization/de-quantization when optimized with MMX technology, SSE, and, compared to implementation in language. As shown in Figure, the optimization factor is the ratio of processor cycles needed to carry out the same task in a -based implementation compared to an implementation with MMX technology, SSE, and. The performance improvement is proportional to the optimization factor. For example, the function GetSAD in motion estimation, when optimized by MMX technology, shows a x improvement compared to the language implementation. SSE provides a 1x improvement, and the improvement for is 12x. The SUM function provides a general indication of the performance gain for a particular module.

In the second section, graphs shown describe the number of frames per second the test program can handle. Figure 6 shows optimization techniques applied to real-life video samples. Total frames per second increase with the application of later generations of SIMD instruction set extensions, providing a significant performance improvement compared to a -based implementation. MMX technology, SSE, and provide an optimization factor of approximately 12, 17, and 2 respectively. Frames Per Second 16 14 12 1 8 6 4 Video Sequence Video Sequence 2 ar (32x24) Grandma (176x144) Akiyo (32x288) laire (176x144) Foreman (32x288) Terry run (32x24) Football (32x24) a. Frames per second achieved by and SIMD instruction sets implementation on video encoding. Video Sequence 2 2 1 1 Football (32x24) Terry run (32x24) Akiyo (32x288) Foreman (32x288) Video Sample laire (176x144) Grandma (176x144) ar (32x24) SSE MMX Technology b. Improvement achieved by various samples of video clips when implemented in and SIMD technique. Figure 6 Effects of SIMD technique when applied to real-life video (resolutions indicated in parentheses). This section illustrates the improvement achieved by implementing the test program with or without Hyper-Threading Technology feature. The application of HT Technology to a real video sample significantly increased the frame rate. The actual number of frames processed depends largely on scene complexity and the amount of motion involved in the scene. These examples show a performance improvement of 7 2 percent (Table 2). 6

Video Sequence HT Technology HT Technology (resolution) Disabled Enabled = Enabled/Disabled Football (32x24) 332.44 381.9 1.1 Terry run (32x24). 4.12 1.8 Akiyo (32x288) 386.3 41.78 1.8 Foreman (32x288) 32.9 333.11 1.1 laire (176x144) 1,44.19 1,74.31 1.2 Grandma (176x144) 1,62.4 1,78.22 1.14 ar (32x24) 471.6 6.6 1.7 Table 2 Effects of HT Technology on a real video sample, expressed in frames per second. onclusion These results show that the features available in the current generation of IA-32 processors enhance the processing efficiency of the major functional modules involved in video encoding. The application of MMX technology, SSE, and instruction set extensions results in significant performance improvements compared to -based implementations. The greatest gains were achieved by. When applied to real-life video samples, the application of MMX technology and later SIMD computational enhancements resulted in respective performance increments of 12, 17, and 2 times, while HT Technology resulted in a 7 2 percent improvement. For developers of DSS solutions, these performance gains translate to improved processor utilization and the benefit of significantly enhanced application performance. 1 Hyper-Threading Technology requires a computer system with an Intel Pentium 4 processor supporting Hyper-Threading Technology and an HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/info/ hyperthreading/ for more information, including details on which processors support HT Technology. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm. 7

Intel, the Intel logo, Intel Pentium 4, Intel Pentium M, Intel Xeon, and MMX are trademarks or registered trademarks of Intel orporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. 9/SA/MRM/PP/1K 39629-1 opyright 2, Intel orporation. All rights reserved.