Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Size: px

Start display at page:

Download "Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group"

Herbert Lawson
5 years ago
Views:

1 Nema An OpenGL & OpenCL Embedded Programmable Engine Georgios Keramidas & Iakovos Stamoulis Think Silicon mobile GRAPHICS

2 Overview Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group Intellectual Property semiconductor cores in the field of computer graphics for mobile/embedded devices.

3 Mobile Graphics: Past- Present-Future t past Visualize data from various sources (Internet of Things paradigm) present future

4 Moore s Law in mobile GPUs SCALE LOG GPU performance GPU Power Display Resolution 10 years ago 5 years ago NOW in 5 years in 10 years Applications in next 5 years will need 3-4x the current GPU perf. Display resolution exponential increase (4K displays are here) BUT Power roughly under the same power budget (few hundreds milliwatts)

5 Application Processor now Multicore CPU Multicore GPU fully programmable Video Encoder Video Decoder Image Processors Camera Processors ASIC/ASIP (semi) programmable Vision Processors Market needs SMARTER mobile GPUs

6 Application Processor future Multicore CPU Multicore GPU fully programmable Video Encoder Video Decoder Image Processors Camera Processors Vision Processors Multicore Multipurpose Multidomain GPUs Mobile GPUs Media Processors & GPGPUs

7 Mobile GPU challenges Performance Power Consumption Low Die Area Latency Wall Memory Wall Frequency Wall Adaptability to new Standards Time to Market Performance per mm 2 /Watt/$

8 Think3D Processing Array Local Caches Shared Cache/MMU Nema Cluster 4 cores/cluster Router Multicore Modular Design NoC based Scalable Design SoCAXI fabric (internal signalling) Disruptive Architecture

9 Nema Graphics Cluster 4-cores DMA riangle Setup T g Tr riangle Cullin Rasterizer Early-Z nterpolators I Nema Cores Stencil/Blender Raster Operations Z-Buffer SIMD Multipoint Texture Mapping I$ D$ AXI AXI AXI Aggregator / MMU Unified Shading Architecture: Vertex, Fragment, GPGPU tasks GPU front-end: fixed hardware GPU back-end: programmable engine GPU Texture Unit: Multipoint, mipmap and texture compression support, separate or system cache

10 Nema Graphics Cluster 4-cores DMA riangle Setup T g Tr riangle Cullin Rasterizer nterpolators I Early-Z nterpolators I Nema Cores Stencil/Blender Raster Operations Z-Buffer SIMD 2D Composition Engine Entropy Encoder Multipoint Texture Mapping I$ D$ AXI AXI AXI Aggregator / MMU 1. Hardware entropy encoder for video processing 2. Backend programmable engine Image Signal Processing (ISP) 3. 2D/Composition Engine (Think2D) Turns off FP cores in simple GPU tasks

register, (4x32 IEEE single, 8x16 Half floats) Ultra-threaded th d Single or

11 Nema ID Hybrid SIMD/MIMD machine Hardware/Software scheduler balances SIMD/MIMD threads in the cluster In-house ISA 32bit 3 b t Integer registers, s, 128bit Vector register, (4x32 IEEE single, 8x16 Half floats) Ultra-threaded th d Single or VLIW-like Dual Issue Core concurrent threads per core High utilization of EX resources

12 Nema ID Configurable number of pipeline stages High MHz frequencies (up to 16 stages) Each thread occupies a different pipe stage Hibernated threads Ready to run threads send for execution No hardware Feedbacks/Hazards/Interlocks less complexity smaller cores low power cores more cores Non Blocking Hide Latency at all levels No Stalling on a miss or long latency arithmetic operations other threads enter the pipeline

13 Low Power Features Power is a design parameter at all design stages Power Emulator estimates power consumption (GPGPUSim-Pow Pow from LPGPU) Core-level Cores can be turned on and off, Hardware Scheduler will not use the off cores Core-level DVFS Memory-level Sophisticated way selection techniques to reduce memory cell accesses Compression minimizes external access Framebuffer compression Z-buffer compression Various texture compression algorithms Separate 2D/Composition Engine Unit (Think2D) Keeps Every-hungry floating point Processors turned-off

14 GPU Special Features Fully Programmable GLSL/OpenGL ES On-the-fly Compression Z-buffer (patent in progress) Framebuffer (65% less framebuffer bandwidth) Hardware Value memoization mechanism reduces computation ti effort 30% (avg) less executed instruction Different number of threads per core (heterogeneous system) Up to 50% power reduction via run-time load balancing+dvfs in a 8-core CMP

GPGPU Special Features Fully Programmable C/C++/OpenMP/OpenCL Faster Divergent Thread performance (inherit by design) Up to 30% in complex conditional programs Faster Single Thread Performance

15 GPGPU Special Features Fully Programmable C/C++/OpenMP/OpenCL Faster Divergent Thread performance (inherit by design) Up to 30% in complex conditional programs Faster Single Thread Performance Compiler driven interlocks & prefetching Up to 100% increase in speed (valuable in GPGPU tasks) Compiler detects dependencies and embeds scheduling info in instructions Maintain pipeline simplicity without hazard detection Easily extendable ISA for new generations apps & standards Currently: instructions for accelerating video processing DCT/iDCT DCT/iDCT SAD (Sum of Absolute Differences) Color Space Conversion on various formats

(Polly) Link Time Optimizations (LTO) Libraries & APIs OpenCL API (based on pocl) OpenGL ESAPI

16 Think3D Ecosystem Assembler based on BinUtils 2.22 (in-house) Compiler based on LLVM-3.2 (in-house) Clang front-end for C/C++/OpenCL In-house GLSL to LLVM IR Polyhedral Optimizations (Polly) Link Time Optimizations (LTO) Libraries & APIs OpenCL API (based on pocl) OpenGL ESAPI (based on Mesa3D) Ported Pthreads-like library allows multithreading to be used immediately Ported libc/ libm / libcl

17 Shorter Time-to-Market Time-to-market market designs is the target Detailed Processor Description and exact Specification of each ISA instruction Hardware Registers, Pipeline Stages, system registers ISA Specification manual automated Documentation LLVM Compiler BinUtils Assembler Linker System C Emulator Verilog RTL Regression Suite Pass/Fail

18 Diverse applications supported Design one, use many Scalability, adaptability Threads per Core Estimate Power Cores per Cluster NoC-based for high number of Cores (8+) Change Pipe stages Modify instruction set

19 Status Today today today 14Q1 14Q1 first release First Release 14Q3 multicluster l t Nema Processor Passes regression tests Benchmarking ISA choices Power Management Algorithms 4 Core/1 cluster Nema multi cluster NoC Based Graphics Modules Rasterizer Interpolators Texture Mapping Memory subsystem Blending Unit MMU API supported OpenGL ES 2.0 OpenCL Testing FPGA prototype OpenGL ES 2.0 OpenCL System-level Optimizations H264/5 Image Processing OpenVX

20 Contact Info HQ Patras Science Park NL High Tech Campus 1-E Sale Inquiries : Rion Achaias 5656 AE Eindhoven sales@think-silicon.com Greece The Netherlands info : Tel: Fax: Tel: + 31 (0) info@think-silicon.com

Ultra Low Power GPUs for Wearables

Ultra Low Power GPUs for Wearables Georgios Keramidas January 2015 The Company Who we are? Think Silicon is a privately held company founded in 2007. What we do? Development of low power GPU IP semiconductor