April 4-7, 2016 Silicon Valley PERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE Pradeep Chandrahasshenoy, Automotive Solutions Architect, NVIDIA Stefan Schoenefeld, ProViz DevTech, NVIDIA 4 th April 2016
SESSION OVERVIEW Overview of the methodologies to optimize Automotive HMI application Introduction to Tegra Profiler Tools Case study: QT5 OSS samples 2
WHY OPTIMIZE? Performance is User Experience SOFTWARE DEFINED CAR IDEAL VS REALITY Lines of Source Code (in Millions) 100% ~In Luxury Car ** ~IVI System ** 80% 60% Linux Kernel 4.x* 40% 0 50 100 Boeing 787 ** NASA Mars Rover# 20% 0% Ideal Car Computers Today's Car Computer * Source: Linux kernel Wikipedia page: https://en.wikipedia.org/wiki/linux_kernel#lines_of_code # Monitoring the Execution of Space Craft Flight Software, NASA ** IEEE: Automotive Designline Used Processing Available Headroom 3
WHY OPTIMIZE? COMPLEXITY MULTI-TASKING & MULTI- RENDERING CONTEXTS PIXEL EXPLOSION & MULTI- DISPLAY SYNCHRONIZATION 4
HOW TO OPTIMIZE? METHODOLOGY IDENTIFYING BOTTLENECKS WHAT'S NEEDED CPU GPU Memory Bandwidth HW Accelerators Other System level Your application & libraries Instrumentation How much time spend in every module? Third party libraries Drivers Tools What do they do? 5
HOW TO OPTIMIZE? KEY TEGRA TOOLS OVERVIEW TEGRA SYSTEM PROFILER TEGRA GRAPHICS DEBUGGER 6
TEGRA SYSTEM PROFILER (TSP) Multi-core CPU profiler for Tegra TEGRA SYSTEM PROFILER Easily prepare a device and deploy application for profiling Quickly identify CPU hot spots, hot paths and L1/L2 cache issues Visualize multi-core CPU activities with a new timeline view Maximize multi-core CPU utilization Visualize CPU, GPU and EMC frequencies Visualize thread state 7
TEGRA GRAPHICS DEBUGGER (TGD) A Console-grade tool to debug & profile OpenGL ES TGD enables graphics development, TEGRA GRAPHICS DEBUGGER debugging & optimization on Tegra devices for OpenGL ES 2.0, 3.0 & 3.1 applications. Identifying performance bottlenecks and GPU utilization Interactive examination of GPU pipeline state Real-time examination of draw calls 8
PROFILING SETUP OVERVIEW DRIVE CX WITH LINUX SSH Display Output HOST PC DRIVE CX DISPLAY 9
QT5: CASE STUDY With QT5 Samples BIG SCENE (qt3d) PLANETS (qt3d, qml, quick) Lots of small geometry Many draw calls Scene graph usage GPU intensive Optimizing GL call stack Tools showcase: TSP, NVTX Tools showcase: TGD 10
QT3D RENDER.CPP Attribute *Renderer::updateBuffersAndAttributes(Geometry *geometry, RenderCommand *command, GLsizei &count, bool forceupdate) { Attribute *indexattribute = Q_NULLPTR; uint estimatedcount = 0; m_dirtyattributes.reserve(m_dirtyattributes.size() + geometry->attributes().size()); Q_FOREACH (const QNodeId &attributeid, geometry->attributes()) { Attribute *attribute = m_nodesmanager->attributemanager() ; if (attribute == Q_NULLPTR) continue; 11
NVIDIA TOOLKIT EXTENSION #include "nvtoolsext.h void hotspotfunc() { nvtxmarka("hotspot reached"); } void render() { nvtxrangeid_t r = nvtxrangestarta("rendering scene"); //render everything nvtxrangeend(r); } 12
QT5: CASE STUDY With QT5 Samples BIG SCENE (qt3d) PLANETS (qt3d, qml, quick) Lots of small geometry Many draw calls Scene graph usage GPU intensive Optimizing GL call stack Tools showcase: TSP, NVTX Tools showcase: TGD 13
GL STATE CACHING 14
UNIFORM CACHING 15
EFFICIENT GPU PROGRAMMING BEST PRACTICES STATES GEOMETRY Do not set states redundantly Try to sort draw calls according to common states Disable unused vertex arrays Use buffer objects Pack small buffers into a single one and use one draw call Use indexed primitives Pack vertex attributes Use uniform winding (clockwise or counter-clockwise) for geometry 16
EFFICIENT GPU PROGRAMMING BEST PRACTICES TEXTURES TEXTURES Use texture compression when possible Prefer immutable textures created with gltexstorage[23]d() Use mipmaps Consider using texture atlases/maps Avoid random access Update textures with gltexsubimage[23]d() Update dynamically generated textures through FBO s 17
EFFICIENT GPU PROGRAMMING BEST PRACTICES RENDERING RENDERING If possible render front to back Avoid reading back from GPU Disables modes/tests that you do not need Clear buffers only if you need to Avoid memory management during runtime Update data only when needed Cull early and often Do computations as early as possible Use shader cache for faster application start Use instancing Use indirect draw calls 18
CONCLUSION Optimize as you develop Identify your use cases Get an overview over the application modules How much time is spent in every module Profile the modules for hot spots Invest the most time in reducing the big hot spots Get the low hanging fruit first Use Tegra Graphics Debugger to analyze your GPU usage Optimize your GL stream and minimize driver overhead 19
RECOMMENDED SESSIONS TALK, TUTORIAL, HANDS ON LAB, HANGOUTS S6181 - Memory Bandwidth Bootcamp: Collaborative Access Patterns S6710 - Developer Tools for Next Generation Graphics APIs S6131 - Nvpro-Pipeline: Handling Massive Transform Updates in a SceneGraph S6810 - Optimizing Application Performance with CUDA Profiling Tools S6111, S6112 - NVIDIA CUDA Optimization with NVIDIA Nsight Eclipse Edition L6135A, L6135B - Jetson Developer Tools Lab H6122, H6157 - Performance Optimization & Analysis 20
April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join