Intel CoFluent Studio in Digital Imaging

Intel CoFluent Studio in Digital Imaging Sensata Technologies Use Case Sensata Technologies www.sensatatechnologies.com Formerly Texas Instruments Sensors & Controls, Sensata Technologies is the world s leading designer and supplier of sensors, controls and protectors across a broad range of markets and applications, including transportation, industrial, and appliances. Today, Sensata Technologies is comprised of three major global business units with sales offices worldwide and business and manufacturing centers in 1 0 different countries. Their innovative solutions in sensors and controls improve safety, efficiency and comfort for millions of people every day. Sensata, which currently employs approximately 9,500 people, manufactures over 20,000 different highlyengineered and application-specific products. Over one billion units are shipped each year. When designing a new vision camera system aimed at automotive and security applications, Sensata utilized Intel CoFluent Studio to select and optimize its next generation image sensing architecture. The team created a model of a camera system and simulated its behavior and time properties. Architecture choices were studied and hardware/software partitioning alternatives explored. For each architecture option, local memory requirements, potential traffic bottlenecks, execution times and complexity of functions were studied and analyzed. This paper illustrates how the early design effort effectively utilized various Intel CoFluent Studio features to create, simulate, and analyze seven different architecture models in approximately four weeks.

Table of Contents 1 Camera System Application Modeling... 3 1.1 Application Description... 3 1.2 Behavioral Modeling...3 1.3 Model Characterization: Time Attributes & Design Parameters...4 2 Execution Platform Modeling... 6 2.1 Application Description... 6 2.1 Performance Characterization...6 3 Mapping and Architecture Modeling...6 3.1 Architecture Description... 6 3.2 Architecture Characterization...7 3.3 Obtained Results... 7 4 Conclusion...1 0

1 Camera System Application Modeling 1.1 Application Description Sensata studied a simple camera system application. The main functions of the system are the following: - Image sensing - Image quality control - Color processing - Display handling (LCD and NTSC) - Monitoring (power control and various diagnosis) - Communication interfacing (I2C or SPI) Image sensing is handled by a dedicated hardware component, the imager, which captures 642x482 1 2-bit monochrome or color images at 60 frames per second. The image quality control function determines the image sensor s mode, monochrome versus color, shutter speed, and compression level. Environmental monitoring includes voltage and temperature monitoring for potential error reporting. Display handling involves reformatting the image for LCD or NTSC display. The communication interfacing includes I2C or SPI interfaces for communicating with a host controller. In its initial stage, the application model was limited to color processing while monochrome image quality control (input control, noise removal, defective pixel removal, image enhancement, output control) was reduced to a simple auto-control model. 1.2 Behavioral Modeling The camera system application is reduced to a basic color processing function. This function takes raw data frames as input and outputs frames in RGB format (8-bit Red, 8-bit Green, 8-bit Blue). A test case for the camera system simulates image sensing and display as simple video data source and sinks. The Video Source function reads test files on the simulation PC s hard disk and sends the data to the ColorProcessingfunction, while the VideoSink simply displays the received data as an image in RGB format. While actual image data is sent pixel by pixel for color processing, it is not necessary to model the camera system at the pixel level since macroscopic latency and throughput performance results are expected. For simplification, the Sensata model assumes that the ColorProcessingfunction receives a complete frame at once. ColorProcessingincludes four concurrent sub-functions that can be pipelined in a certain order: Defective PixelRemoval, White Balance, Demosaic and Sharpen. Intel CoFluent Studio message queues are used to model FIFO channels between stages of the pipeline to enable independent and asynchronous communications between stages.

Each pipeline function follows the same behavioral design pattern as shown below: After an initialization sequence (Initoperation), an infinite loop waits to receive a frame from stage N-1 through ChannelIn, processes the frame (Algorithm operation), and sends the result to stage N+1 through ChannelOut. A first tokenbasedsimulation without data type or data processing algorithm definitions is run to verify and analyze the complete system control and data flow. Next, C algorithms are added for each color processing pipeline function. They can be copied and pasted in the definition area of the Algorithmoperation or can be declared as external C routines, inserted to the project as external files, and called from within Algorithm. As the model of the ColorProcessingsystem is repetitive, a possible solution is to copy and paste the four sub-functions and change the code in each Initand Algorithmoperation. A more efficient way of duplicating functions is to make them reusable IP models in libraries. In this case, a single ColorProcessingStage IP is created that includes all four possible algorithms. An external parameter is defined to select which algorithm to use when reusing the IP. This mode offers the ability to test the application with different pipeline orders provided that all pipeline stages have the same input and output data formats (which is not the case for the Sensata model). A further simplification is to model the ColorProcessingfunction as a single stage function. It is defined as a vector of functions that can be instantiated from one to four times in multiple instance mode. This does not require copy-paste or multiple stage drawings, and, if applicable, has the advantage of testing the pipeline in any desired order for any number of stages (this also requires compatible input/output data formats for all stages). 1.3 Model Characterization: Time Attributes & Design Parameters Since Intel CoFluent Studio models are timed, durations of computations (operations) and communications (inputs/outputs) have to be defined. The image capture duration is defined at 1 6 ms for outputting 60 frames per second. VideoInand VideoOutmessage queues are set to 1 0 ns (nonsignificant times, as not important in the scope of this study) for send and receive times. Pipeline channels are set to complete one 1 6-bit pixel transfer in a single cycle: send time is set to a very short non-significant value (1 0 ns) and receive time to a number of cycles that correspond to the number of pixels per frame: 642 * 482 = 309444. This creates a realistic total transfer time (send time + receive time). In order to make the model independent of the frame size, a specific keyword USERDATASIZEis used. This corresponds to the value of a specific field in the model data structure set to represent the size of the data, and removes the need for the real data. To give greater flexibility to the simulation, the user-customizable data size is defined as a tunable generic parameter (a sort of simulation knob) that can be set at simulation time and called FrameLength. FrameLengthranges from 1 00 to 642x482 pixels. A number of cycles per pixel is defined for the duration of each algorithm. It is determined by existing profiling data or estimations for each color processing stage. Therefore, the duration of the Algorithmoperation corresponds to the number of cycles per pixel (specific for each algorithm) multiplied by the number of pixels per frame (FrameLength). The conversion from number of cycles to a time value is based upon the definition of the cycle period of the execution target. For example on an FPGA at 50 MHz, 1 cycle = 1 /50 us = 20 ns. In order to calculate the pipeline latency for each frame, an additional timestamp field was added to the frame data structure. The timestamp field is used to save the time when the frame enters the pipeline, enabling the latency calculation when it exits. The Intel CoFluent Studio simulation API provides access to the simulation time. In order to calculate the pipeline latency for each frame, an additional timestamp field was added to the frame data structure. The timestamp field is used to save the time when the frame enters the pipeline, enabling the latency calculation when it exits. The Intel CoFluent Studio simulation API

provides access to the simulation time. The simulation of the application model offers visual verification of the effectiveness of algorithms within Intel CoFluent Studio s image display tool. Results are shown below (input images on the left displayed with off-the-shelf tool, RGB output images on the right displayed within Intel CoFluent Studio). In addition, to analyze and validate the pipeline effect and its time properties, a timeline chart, or sequence diagram similar to a Gantt chart, is automatically produced during simulation. Sensata had access to real data processing algorithms. This is not always the case. If the actual data processing algorithms are unavailable, Intel CoFluent Studio can run token-based simulations with empty algorithms, using only their time characterization. For the Sensata case, profiling of algorithms execution within Intel CoFluent Studio can be used as an indication for the duration of each algorithm execution (min, max, average). This includes comparing the dynamic execution profile against simulation time or its number of executions, which is helpful in analyzing complexity

of functions. 2 Execution Platform Modeling 2.1 Application Description For its new vision camera system, Sensata wanted to study how two separate components from the previous design could be merged into a single system-on-chip or the proper partition between separate components. The execution platform provides various software or hardware execution resources: a DSP with its coprocessor and RAM, an FPGA, with no soft or hard core, including RAM for data buffering connected to the imager. Various pixel busses, 1 2 or 24 bits, link the different elements. 2.2 Performance Characterization Intel CoFluent Studio's platform models are created by assembling generic hardware components to provide computing, communication or storage resources. Hardware (ASIC, FPGA, co-processor, accelerator, etc.) or software (DSP, CPU, MCU) computing units are called processors. Communication links are called nodes, and can be characterized as bus, routing network or point-to-point. Storage units are called shared memories. Universal behavioral and performance attributes characterize elements of a platform model..sensata created and characterized three different platform configurations for representing potential execution structures: - Platform1: two hardware processors and a bus - Platform2: three hardware processors and a bus - Platform3: one hardware processor, one software processor, and a bus The cycle period of hardware processors is defined as a generic parameter ranging from 1 0 to 1 00 ns with a default value of 20 ns. The software processor is characterized through a relative speed ratio, which is a multiplicative factor applied to initial time values given at the application level to simulate a faster or slower processor. The transfer time of the bus was also modeled as a generic parameter with a varying value. 3 Mapping and Architecture Modeling 3.1 Architecture Description Sensata explored multiple mapping alternatives: - ConfigurationA: image quality control and color processing running on the FPGA - ConfigurationB: image quality control and color processing running on the DSP - ConfigurationC: image quality control running on the FPGA and color processing running on the DSP The display, communication and monitoring functions run on the DSP. Image sensing runs on the FPGA. Objectives Size & cost of FPGA RAM sizes Latencies (Imager > Format > Display) Bottlenecks DSP load Configuration A Configuration B Configuration C Simplified models of ConfigurationA and ConfigurationBwere completed. ConfigurationCwas found similar to ConfigurationB if using color processing at frame synchronization, but differed if using monochrome processing or row synchronization.

3.2 Architecture Characterization A total of seven simulation models are created for comparison: five simulation models for ConfigurationA and two for ConfigurationB. The seven models are obtained in minutes using Intel CoFluent Studio s drag-and-drop mapping feature that allows in one click of a mouse to allocate functions to processors. The resulting architecture models are automatically generated in SystemC by the tool. Memory sizes, power consumptions and cost values are defined for processors, functions, operations, and FIFO channels following certain rules. The utilization of each component at any level of the hierarchy in the model is evaluated as a load ratio (%) or in number of cycles per seconds (Cyps, KCyps, MCyps, GCyps). For example, the duration of color processing algorithms and data inputs/outputs is dependent on the image size characterized by the FrameLengthgeneric parameter. The cost attribute is used to represent the silicon area. 3.3 Obtained Results For each simulation, a table of performance results can be obtained and exported to an Excel spreadsheet, as shown below. The following useful findings are extracted from the simulation and used as system architecture guidelines. Finding the maximum lines of image to be processed in a frame timeline When FrameLength= 321 00, the image capture load is 98.81 %. There is a stall in the frame sent to the camera system. The capture load reaches 1 00% when FrameLength= 31 500. This shows that 49 lines of 642 pixels (642 * 49 = 31 458) is the optimal number of lines to be processed in a frame timeline. Therefore, to meet the timing requirements, about 1 0 (= 482 / 49) stages of pipeline need be considered during each algorithm. This information is useful when the color processing algorithms are implemented in FPGA or ASIC. Finding potential bottlenecks With FrameLength= 31 500: stage 1 load = 38.59%, stage 2 load = 96.05%, stage 3 load = 57.96%, stage 4 load = 36.04%. This information illustrates that stage 2 is the potential bottleneck. Stage 2 can be analyzed and implemented as a multiplestage pipeline. Providing the utilization for each function with different pipeline numbers With FrameLength= 309500 (642 * 482 = 309444): stage 1 load = 96.49%, stage number = 6; stage 2 load = 93.46%, stage number = 1 7; stage 3 load = 92.64%, stage number = 1 0; stage 4 load = 90.71 %, stage number = 6. Comparing dynamic memory utilization When FrameLength= 309500 and processor relative speed = 1 : memory min = 39.02 Kbytes, max = 2305.87 Kbytes, average = 1 099.1 2 Kbytes.

When FrameLength= 309500 and processor relative speed = 2 (2 times faster): memory min = 39.02 Kbytes, max = 1 399.1 3 Kbytes, average = 485.87 Kbytes. This illustrates that as the processor speed increases, the memory size decreases due to less processing time and a decrease in parallel activities. Comparing dynamic power consumption When FrameLength= 309500 and processor relative speed = 1 : power min = 0 mw, max = 54 mw, average = 31.47 mw. When FrameLength= 309500 and processor relative speed = 2 (2 times faster): power min = 0 mw, max = 50.59 mw, average = 25.35 mw. Simulation results reflect the combined effects of a reduction in power due to the decrease in algorithm execution times, and power increase related to a processor speed increase. Dynamic profiles can be obtained to do precise observations in time. Finding the minimum number of redundant algorithm processing engines for given parameters, while still meeting timing requirements When FrameLength= 309500 and number of redundant algorithm processing engines = 1 3, the resulting image capture load = 95.05%. When FrameLength= 309500 and number of redundant algorithm processing engines = 1 4, the resulting image capture load = 1 00%. When the image capture function load does not reach 1 00%, it indicates a stall in the frame sent to the camera system and causes the processing to be incomplete within one row timeline. The above results show that 1 4 is the minimum number of blocks to be processed to meet timing requirements. Optimizing the tradeoff between the number of redundant algorithm processing engines versus the memory size and the power consumption The larger the number of blocks, the more memory is required due to the increase in processing engines. This, in turn, increases the power due to the combined effects of more processing engines and less processing time. When processor speed is increases, less memory is required due to less parallel activities. This decreases the power due to the combined effects of high processing speed and less processing time. Optimizing the tradeoff between memory size, power consumption, and cost The higher the number of columns per image, the more memory is required due toincreased capacity of message queues. When processor speed is up, less memory is required because of the decrease in parallel activities. This decreases power requirements due to the combined effects of high processing speed and less processing time. Finding if a frame can be processed within a frame timeline in software A frame can be processed within a frame timeline if the color processing algorithms are implemented in software processor. The results show 96.81 % total utilization for the software processor with FrameLength= 309500. The following table summarizes Sensata s findings for each of the seven simulations which serve to identify the optimal architecture by providing

guidelines for performance/memory/power/cost tradeoffs. From this table, Sensata deduced that: - Models 1, 4, 5A, 6, and 6A are preferred candidates compared to models 2 and 5 - Models 1, 4, 6, and 6A are hardware implementations, model 5A is a software implementation - Models 1, 6, and 6A are parallel processing implementations - Model 1 has lower power, but a small additional cost compared to models 6 and 6A - Model 6A has approximately 1 0% lower cost compared to model 1 4 Conclusion Sensata s experience with Intel CoFluent Studio was largely positive as results obtained went far beyond what could be obtained with just spreadsheets. Spreadsheetsprovide theoretical static best- and worst-case figures, whereas Intel CoFluent Studio allows observing and analyzing dynamic profiles of the system s significant properties in the context of realistic use cases. Sensata s analysis of Intel CoFluent Studio benefits were the following: - Application decomposition and application-to-platform mapping prepare efficiently for implementation - System-level modeling and graphical notations help better master complexity and improve productivity - Short design-space exploration and performance-analysis iterations allow validating architectural choices - Architectural exploration and performance analysis help optimize architectures The optimal architecture was achieved through iterating on various model configurations, mappings and characterizations. However, the research and data gathering work was key, says Qing Song, DSP Systems-On-Chip architect, Sensata Technologies. Copyright 201 2 Intel Corporation. All rights reserved. Intel and Intel CoFluent are trademarks of Intel Corporation in the U.S. and/or other countries. *Othernames and brands may be claimed as the property ofothers.