The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1
MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content Power consumption MP SoC..and it comes in many flavors 2
Agenda MPSoC Solution Space Virtual HW Platforms What are Virtual Platforms? Virtual Platform usage for Architectural Exploration Virtual Platform usage for SW development Requirements from Virtual Platforms The ESL Solution Pyramid 3
Solution Space: HW Implementation Standard 1 Task 7 Standard 2 Task 7 Standard 3 Task 7 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 High performance Low power HW Mapping Limited reuse No flexibility separate design for each platform Processor HW accelerators Memory 4 ASIC + Processor for global control
Solution Space: Programmable Accelerators Standard 1 Task 7 Standard 2 Task 7 Standard 3 Task Task 7 7 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task Task 1 Task Task Task 11 Task Task 1 2 2 Task Task 13 3 Task Task 4 4 Task Task 5 5 Task Task 6 6 Task Task 8 Task 81 Task 8 high performance high flexibility SW Mapping high design effort Low SW development productivity high differentiation Task 1 Task 2/3 Task 3/5/6/7 Accelerator Accelerator Accelerator Task 8 Processor + Ctrl µc IP 5 Mem Mem Mem Heterogeneous Custom Processor SoC
Homogeneous MP-SoC Mesh Standard 1 Task 7 Standard 2 Task 7 Standard 3 Task 7 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 8 Design flexibility Reuse for multiple apps Redundancy Processor SW Mapping Interconnect Difficult to program Communication becomes bottleneck Load Balancing High chip/royalty cost Memory 6 Homogeneous Multi-Processor Mesh
Solution Space: Heterogeneous SoC Standard 1 Task 7 Standard 2 Task 7 Standard 3 Task Task 7 7 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task Task 1 Task Task Task 11 Task Task 1 2 2 Task Task 13 3 Task Task 4 4 Task Task 5 5 Task Task 6 6 Task Task 8 Task 81 Task 8 Reuse of legacy design Reuse of legacy SW SW Mapping Load balancing Memory sub-system design Task 1 Task 6/7 HW Accelerator Task 3/4/5 Task 8 Accelerator Task 2 Accelerator Processor Accelerator + Ctrl Mem Mem Mem 7 Heterogeneous Multi-Processor SoC
The ESW View User Interface, User level functions Large ESW stack >10 Million lines of code C, C++, Java Soft real time - Defined by user perception Cache, MMU Use generic components for data movement Generic bus arbitration + peripherals MOPS GOPS KOPS User-Level Functions (Applications, UI) Middleware (Algorithm) Board Support Package Algorithm rich Small ESW stack ~1 Million lines of code C, C++ (no Java), or assembly, or Matlab Hard Real Time Cache off, MMU off Smart DMA, guaranteed Quality of Service Data locality NoC,, guaranteed QoS,, predictable latency Dedicated communication channels Hardware abstraction Layer + OS Support Thin layer ~100K lines of code C with inline Assembly, maybe C++ Accurate Response Time 8
What is a Virtual HW Platform? A SW model of the SoC HW Processors, accelerators, peripherals and interconnect HDS HAL, Drivers, O.S., Enables architecture exploration and optimization Enables ESW development, debugging and optimization Virtual HW Platform Custom DSP CorXTend Different abstraction levels for different use models FPGA FPGA Periph emem MEM 9
Virtual HW Platform Value Fast Allows accuracy-speed trade offs Flexible Cost effective Scaleable Observable and Controllable Available early in the design cycle 10
Virtual HW Platform Value SW Arch SW SW Arch HW/SW SW Arch Implementation Verification Implementation HW/SW Verification Implementation HW/SW Verification Virtual Platform Virtual Platform Emulation OR FPGA board Development board HW Arch HW (RTL) Design & RTL2GDSII Fab 4m 6m 6m 3m Early SW Development Real HW and SW Co-Design 11
Virtual HW Platform for Architecture Exploration Product Specification Traffic Generator Instruction Accurate Processor Memory Memory Controller DMA Baseband Accelerator accelerator (SPD) Custom Peripheral Processor (PD) 12
Platform Architecture Cycle Accurate TLM RAM RAM ROM ROM ARM926 ARM926 INT INT AHB AHB Bus Bus LCD LCD Loop Loop Filter Filter Input Input H.264 H.264 Decoder Decoder CABAC CABAC OCP OCP Bus Bus OCP OCP Bus Bus FRAME FRAME BUFFER BUFFER AV abstraction OCP TL2 13
Virtual Platforms for Architecture Exploration I need to constantly read and decode H.264 bitstreams. Virtual Platform helped me to find out that pixel data can be best transferred via DMAs from SRAMs. Control data can be directly put into the Entropy Encoder SRAM shared memory via my bus interface. Input Stream DMA I need fast an parallel access to reference and predicted frame data. Early architecture exploration helped me to find out that two SRAMs with DMA removes DMA this data bottleneck. SRAM Reference Pixels Motion Estimation Accelerator SRAM Output Pixel Data DMA DMA SRAM Predicted Pixels I am responsible for the control and synchronization of all cores. I do not require fast access to the pixel data. The access to the shared e.g. control and parameter data eases programming of this accelerator. Controller DMA SRAM Input Macroblock Deblocking Filter Accelerator 14 SDRAM Memory Controller DMA SRAM Output Macroblock I need to process the entire uncompressed 6MB frame 30 times per second! Architecture exploration helped me find out that DMA assisted input and output Macroblock memories are the best option.
Virtual HW Platform for ESW Development Product Specification Virtual Platform Instruction Traffic Accurate Generator ISS Instruction Accurate Instruction Baseband Processor Accurate Sub System DSP ISS Memory Memory Controller DMA Baseband Accelerator Peripheral (SPD) Custom Peripheral Processor (PD) 15
Virtual Platforms for ESW Development Virtual Cell Phone App s SW App s SW App s SW App s SW GUI App s SW App s SW App s SW App s SW SWAPI SWAPI SW Development Tools gdb DDD Ext SWAPI AP-RTOS Int SWAPI BP OS Int Ext AP-RTOS Int BP OS Int IP Library HAL System Simulation HAL HAL HAL Display Key Board Chip Set Base Station Platform Architect Simulator Display Key Board USB Host HW (PC, EWS) JTAG Cell Phone HW JTAG USB RF 16
17 Virtual HW Platform for ESW development
Virtual Platform What is Required? Ultra fast processor simulator ISS Accuracy-speed trade-offs Library of fast processor models Explore different communication paradigms Strong analysis solutions Library of communication protocols Integrate DSP algorithm Library of popular Wireless and MM standards Ultra fast platform Accuracy-speed trade-offs Multi core and platform debugging Standard based 18
Why Standards Based? Device Manufacturer Develop your own Processor Designer Signal Processor Designer Platform Creator Custom DSP CorXTend Virtual Platforms Model Designer IP Providers FPGA FPGA Periph emem MEM Platform Architect 19
Solutions Pyramid ESL Assessment ESL Methodology Transition Vertical Markets Video Office Wireless Product Platform Capture & Analysis for Architecture Development Product Platform Capture & Distribution for SW Development Processor Modeling Interconnect Modeling & Design Peripheral Modeling Custom Processor Design Algorithm Design Block IP Design and Verification 20
Summary ESL Assessment ESL Methodology Transition Vertical Markets Video Office Wireless Custom processor design Complex algorithm design SoC architecture exploration SoC verification Modeling for speed Virtual Platforms for ESW Integrated solution Pyramid Product Platform Capture & Analysis for Architecture Development Product Platform Capture & Distribution for SW Development Processor Modeling Interconnect Modeling & Design Peripheral Modeling Custom Processor Design Algorithm Design Block IP Design and Verification 21
So what s the next level of abstraction? Results (transistors per engineering effort) Processor is the nand gate of the future 1999 Level 1985 1992 Register transfer level 1978 a 0 d q Gate/logic level b 1 s clk Transistor level 10-100X on Simulation 10-100X on design productivity Impose design restrictions Effort 22
The Next Level Challenge The next level is (maybe): Processors and HW accelerators SW tasks running on these processors Transaction level modeling to model communication What are the design restrictions we need to impose to gain 100X productivity?...and what are the benefits to the users? Processors Simplify? No Cache? No MMU? Speculative execution? Branch prediction? SW tasks Threads? Locks? R/W to/from ports only? Programming models to abstract the SoC (message passing, SMP, streaming)? What do we do with all the legacy code? Communication Memory sub system design, NoC design 23