LINUX KERNEL UPDATES FOR AUTOMOTIVE: LESSONS LEARNED TOM MCREYNOLDS, VLAD BUZOV AUTOMOTIVE SOFTWARE OCTOBER 15TH, 2013
Why kernel upgrades : the problem Linux Kernel cadence doesn t match Automotive s Kernel incremented ~ 5 times/year Cars: up to 3 years development time for single project (or more) Can t ignore changes; newer is (usually) better: bug fixes, perf optimizations, better h/w support
Why kernel upgrades : the problem (cont) Need Linux kernel at start of development Kernel will be old when the s/w goes into cars Stick with old kernel? May miss bug fixes, perf. optimizations, better h/w support What about Back-porting improvements? Can be hard, risky if software architecture changes (common) New kernel has all components tested together Strategy Pick kernel version at start of development Update, port and test before shipping Do it with the least number of problems possible!
Kernel upgrade : why we did it Upgraded from 2.6.36 to 3.1 What the new kernel bought us Feature improvements and fixes GPU drivers moved to kernel, Linux style ; more efficient Change page attribute feature Stability and Reliability The newer kernel is extensively tested in NVIDIA Performance improvements VFS scalability with multi-threaded workloads Slab allocator speedups
Kernel upgrades : why we did it (cont) Benefits outside the kernel New architecture of NVIDIA drivers and proprietary software Helped avoid a class of potential performance problems 100 50 0 MFrag/sec MTriangle/sec 2.6.36 3.1.10 HW Floating Point The newer kernel runs libs and apps built with hard float 100 80 60 FPU FFT (sec) FPU Raytracing (sec) 2.6.36 3.1.10
Kernel upgrades : strategies Maximize Longevity of Kernel version Minimize the need for full kernel updates Plan for kernel updates during development cycle Don t wait until circumstances force an unplanned upgrade Infrastructure, analysis, coordinate can make it much less painful Consider supporting field upgrades Most useful for critical fixes Last resort for bad bugs
Kernel upgrade strategies Maximize Longevity
Maximize longevity LTS kernel versions LTS Long Term Support Maybe higher stability Longer support Where possible, choose software features with mature support Less API changes, less bugs, etc. Good Example: Filesystem - EXT3/4 Bad Example: Filesystem BTRFS (buggy in 2.6.36)
Maximize longevity (cont) Monitor upstream kernel for fixes Important fixes will need to be back-ported The LTS maintainer will back-port selected fixes But may not be the back-ports you want Choose hardware with mature support Not always possible, especially for hardware (mature == older, obsolescent) Example: DLINK USB/Eth dongle Driver is mature but device no longer available Not an issue for proprietary hardware support (you have to write it anyway)
Maximize longevity (cont) What actually happened Choosing an LTS kernel may not always be possible Better to pick a kernel version that has been well tested on the SoC For NVIDIA, that was 3.1 Leverages the work done for Google Android # of Tests X11 Flash, bootldr, kernel System Stress Networking Graphics Multimedia Interfaces Storage/Filesystem # of Tests
Maximize longevity (cont) Not all required features are stable or mature Cgroup is a functionality that has gone through considerable churn and bugs Switched to using ext4 over btrfs LTS maintainer may not back-port all needed fixes May still require back-ports from a later kernel version Cgroup threadgroup lock race is fixed in 3.4 and not in 3.2
Kernel upgrade strategies Update kernel version
Update kernel version : risks Bugs Significant changes all over the kernel tree Kernel quality doesn t always improve in every area Can introduce new bugs Testing not rigorous, uneven May expose bugs previously hidden in drivers for proprietary h/w May require significant time and resources to bring all features to the same level of stability
Update kernel version : risks (cont) Driver interface changes In-house drivers will likely need changes May need to change driver significantly to take advantage of new/improved features E.g. sleep modes, power management GPL-only API may be a problem for proprietary modules Behavior changes: No promises from Linux Scheduler, interrupt handling, etc. can change or be redesigned In general, expect changes in performance, code size, etc.
Update kernel version : dependencies Not enough to just update the kernel May need user space changes as well for maximum advantage Eg: Enable hardfp Library versions can change API changes: Extensions, Obsoleting old APIs
Update kernel version : dependencies (cont) Tool chain can change Forces revalidation of generated code Must review and update compiler options Newer compilers stricter: warnings can become errors, require code changes Enable hardfp ABI, use tested build configurations of NVIDIA libraries System Behavior change Performance may change (not always up!) Code Footprint changes Boot-time increased after change and needed work to get back to required limit.
Kernel upgrade strategies Updating the kernel What actually happened
Updating the kernel: what happened Newer compiler had bugs gcc 4.6.1 broke NVIDIA graphics SW Fixed in 4.6.2 Ended up using 4.5.3 which is stable So did the C library eglibc malloc deadlock bug in 2.15 fixed in 2.17 eglibc ld-linux loader bug fixed in 2.16
Updating the kernel: what happened Rebuilt all user space libs and apps with hardfp No single build environment, every vendor must rebuild Didn t get full support from all vendors Extra effort was needed to ensure all builds were fixed RPC deprecated in eglibc RPC was being used by one of the vendors Back-ported patch to re-enable in eglibc Kernel size increased Resolved with more aggressive reduction in built in features
Kernel upgrade strategies Update kernel version Minimizing the Pain (or Lessons Learnt)
Minimizing the pain: analysis of new kernel (1) Plan for change Give yourself time to prepare before upgrading Validate new kernel and kernel environment Benchmark performance on new vs. old Benchmark should exercise use cases similar to actual system (you do have these benchmarks, don t you??)
Minimizing the pain: analysis of new kernel (2) Analyze functionality changes: Scheduler Driver interface System interface Toolchain changes System lib changes
Minimizing the pain: analysis of new kernel (3) Standardize build environment for all vendors Compiler, key compiler flags, headers, libs the same Configure so compiler, compiler flag, lib changes can be changed independently Need to be able to configure for old version; regression testing Do dependency check on all components Check eglibc, other low level libs New (additional) APIs ok, removed ones are not E.g.: RPC support disabled in eglibc-2.15
Minimizing the pain: design for change Making Robust Code If possible, fix code that depends on system behavior Code that maxs out available performance to be functional Ok to max out perf.; not ok to depend on it for acceptable behavior Code that uses all allocated space Leave margin: also good for later upgrades, bug fixes Eliminate kernel dependencies in application code High level support libs (OpenGL-ES, gstreamer) hide kernel changes Kernel interactions should be through POSIX Dependency checking tools help
Minimizing the pain: the bottom line Our important lesson: you own all of the Linux kernel code Linux distribution vendors may not have the resources to own all the code; a very big job Linux is only consumer grade quality; you will have to make it up somehow There are initiatives to make it better, but economics are doubtful OSS is not free as in Free Beer Ensuring quality costs $$ Somebody has to pay for it
Kernel upgrade strategies Field Upgrades
Field upgrades Truth in advertising: we don t support this Caveat emptor! Critical bugs may be found late in the cycle Code base may have subtle bugs that dev. testing doesn t catch To guarantee bug counts requires a lot of work: ISO-26262, etc. Expensive to guarantee prevention, but can allow for cure Feature enhancements become possible Eg: additional codec support
Field upgrades : issues to consider Upgrade notification How will the end user know there is a waiting upgrade Safe field upgrades Ability to ensure security Only allow authorized software Always ensure previous version is usable, just in case Hands-free Upgrade OTA wireless connectivity can enable this
Questions?
Thank you!