AOSP Mini-Conference. Linaro - PDF Free Download

AOSP Mini-Conference Linaro

Welcome ENGINEERS AND DEVICES WORKING TOGETHER Main difference between the miniconference and regular Connect talks: Let s be more interactive! One additional purpose of the miniconference: Bring together the various groups inside Linaro that work on the AOSP codebase: LMG -- probably the most obvious use of AOSP LHG -- Android TV Potentially LITE -- Brillo Kernel, Toolchain, -- need to support both regular Linux and AOSP use Are there other groups (member engineering teams, maybe) here? What is your use of the AOSP code base?

Filesystem analysis Satish Patel <satish.patel@linaro.org>

File System analysis Filesystems investigated: ext4, btrfs, f2fs, nilfs, squashfs Variants: encryption enabled/disabled, compression off/zlib/lz4 File system analysis briefing (ongoing changes) https://docs.google.com/a/linaro.org/document/d/1jam-plv9iefnoqujzywzoy8u9d9gnmpwda 3MItxsPsU/edit?usp=sharing Challenges Fixed build support for f2fs image generation (core.mk & image size alignment to 4096) Fixed sparse raw image generation issue Need to use for btrfs and nilfs Image generation for btrfs, nilfs, squashfs etc.. Benchmark porting - bonnie, iozone Partition overload scripts and long run impact scripts

Filesystems - A Brief Feature/FS ext4 f2fs btrfs nilfs squashfs Introduction Most used in linux based system Flash Friendly File System B/Better/Butter File System New Implementation of LFS Compress read only File System I-node Hashed B-Tree Linear B+ Tree B-Tree Block Size Extent Fixed Extent Fixed Fixed Type Unix like File Structure Log File Structure Copy On Write Log File Structure UFS Allocation Delayed Immediate Delayed Immediate NA Journal Ordered, WriteBack NA NA NA NA Ubuntu, Most mobiles Moto Series Suse Enterprise Ubuntu,NixOs Live CDs,Android

Filesystems - A Traditional Layer WebKit Sqlite Video/Image Application - file access - dir operations - file indexing and management - security operations Memory Management OS Logical File System(ext4, f2fs, btrfs etc..) - data operation on physical device - buffering if required - no management Basic File System Device Driver 1 Device Driver 2 Device Driver n Storage 1 Storage 2 Storage n

Filesystems - Basic Types LSF- Log File Structure COW - Copy On Write Image courtesy: http://dboptimizer.com/wp-content/uploads/2013/06/screen-shot-2013-06-03-at-10.28.44-am.png http://tinou.blogs.com/.a/6a00d83451ce5a69e2016302fe0458970d-500wi

Filesystems - Test Environment Hikey - 96Board 1GB RAM Cortex-A53 Octa Core emmc Popular on embedded Device Cheap & Flexible Fast read & random seek Domains - navigation, ereaders, smartphones, industrial loggers, entertainment devices etc.. http://www.96boards.org/product/hikey/ AOSP + Linaro PatchSet (branch : r55, kernel 4.4) F2FS, Ext4, Squashfs, btrfs, nilfs Benchmarks Vellamo, RL bench, androbench Bonnie (ported for Android) Iozone (ported for Android) Overload and long run test - in progress!!

Filesystems - Challenges Fixed build support for f2fs image generation (core.mk & image size alignment to 4096) Fixed sparse raw image generation issue Need to use for btrfs and nilfs Image generation for btrfs, nilfs, squashfs etc. (raw -> format -> sparse) Benchmark porting - bonnie, iozone Partition overload scripts and long run impact scripts

Filesystems - Results Given ranking based on performance for each benchmark and test Average rank for iozone test (span over various record length) Few more points to consider Performance impact as filesystem ages CPU utilization O_SYNC (-+r option iozone) : requires that any write operations block until all data and all metadata have been written to persistent storage. This ensure file integrity (rather than data integrity with O_DSYNC flag)

Filesystems - iozone average (full test) Write - btrfs (lzo/zlib) wins Read - ext4 performance is comparable to btrfs Note: nilfs failed to complete full iozone test

Filesystems - small read and write (64K) Small records/file F2FS wins with sync option For read NILFS has better performance on cache read

Filesystems - 1MB file test Ext4 outperform on all read operations F2FS has good score (with sync flag)

Filesystems - 512MB, 4MB Write - btrfs (lzo), with sync flag ZLIB wins the race not sure why? 4MB file read EXT4

Filesystems - bonnie results Low the better Btrfs (lzo, zlib) gives good number but.. At the cost of CPU eating.. No of kworker threads are more. Coming up next F2FS/Ext4 has fair amount of CPU usage on read/write F2FS outperform on char operation - do we have usecase?

Filesystems - hdparm Squashfs is better ( after btrfs )

Filesystems - speed variation Low the better Btrs wins for avg. speed But, speed (read/write) deviation is very less for f2fs

Filesystems - disk access Disk reads are more for f2fs ( use of less buffered i/o) Nilfs disk read are less More writes for btrfs ( might be due background write activities, for snapshot handling) High disk utilization in case of nilfs NilFS if we do not run gc - 1000 runs, system went to out of disk space

Filesystems - btrfs low lights Though BTRFS has good performance High CPU Utilization: More kernel threads For small data (<1MB), btrfs under perform over f2fs and ext4. Not recommended where small i/o transaction with sync is expected. E.g. frequent calls to DB entries. Btrfs does not force all dirty data to disk on every fsync or O_SYNC operation ( risk on power/crash recovery) Yet to test effect on long run test??

File System analysis - Summary All relative rank graphs is available at F2FS/Ext4 Wins for Small File Access (4K-1MB) + DB Access with disk data integrity Potential use case: Industrial monitoring system, Consumer Phone, Health monitoring system NilFS outperforms for SQLite operations https://docs.google.com/a/linaro.org/spreadsheets/d/1ctknbbvwujrizws8oqcb5l8gcd CLuzktJgx-K_CMgt0/edit?usp=sharing Only cache here is, metadata/data gets updated later once get written to log file ( kind of extended version of fdatasync over fsync) Can be useful for power backed system and continuous log recording of small data (upto 4K) but with good amount of storage It quickly fill up the space if GC is not called in between. On 5GB space, it just went out of space for 1000 runs of iozone test. Do not recommended for Embedded System SquashFS : Good buffered I/O read Can be used for read only partitions ( system libraries and ro database)

File System analysis - Summary BTRFS : Large file + large RAM LZO - Outperforms for block write/read operations ( > 4MB) Potential use case: Low lights: High cpu utilization ( more no# of threads) Not recommended where small i/o transaction with sync is expected Risk on power failure recovery (Not high, but sometimes corrupt itsself) Hybrid use of different file systems on multiple partitions can improve overall performance e.g. In flight entertainment system ( mostly for movies/songs/images etc..) Portable streaming & recording devices ( should be power backed up) large read/write (movies, extra download) on BTRFS partition All small read/write (docs, images) on f2fs/ext4 partition All database access insert/update/delete on f2fs/nilfs partition Note: Yet to perform impact on file system as it ages

Filesystems - Todo List Perform long run test (3-4 days, with various operations) and measure the impact Partition overload testing - impact on low disk availability Encryption impact Overhead of overlayfs etc. if we need to add drivers, HALs etc. for a specific piece of hardware to /system when otherwise using a common /system with HAL consolidation Any other?

Filesystems - Some points of discussion Any other filesystems (out-of-tree, perhaps) we should look into? Impact of storage technology (devices might start using NVMe) Best way to measure filesystem longevity

Thanks! Questions? <satish.patel@linaro.org>

HAL Consolidation Rob Herring <rob.herring@linaro.org>

HAL Consolidation - one build, many devices Goal is one Android build/filesystem per cpu architecture while maintaining configurability for device specific builds: http://tinyurl.com/zscbbrx A directory per feature for features more than just a config variable KConfig based configuration for features Supporting DB410c, HiKey, Nexus 7, QEMU, RaspberryPi 3 Tablet/phone or TV targets Next platforms or targets to add? Possible next config features: Anything the next device needs Any feature Linaro is working on Custom compiler and compiler flags Kernel build integration malloc selection f2fs filesystem

HAL Consolidation - Graphics (Done) CI job for Mesa Android builds GBM based gralloc implementation - GBM map/unmap support Mali (HiKey flavor) support in build YUV planar support GBM allocation and EGL import CSC conversion in GPU shader for gallium - Thanks Rob Clark Initial vc4 support - still some issues Supporting running Android under Xen and KVM on arm64 Various driver and build fixes

HAL Consolidation - Graphics (ToDo) drm_hwcomposer and HWC2 WIP drm_composer HWC2 support gbm_gralloc Support scanout buffer alloc from KMS node Gralloc 1.0 support minigbm support from Google: https://chromium-review.googlesource.com/#/c/38550 5 HWC2 necessary for upstream explicit sync support Being worked on by Collabra Overlay and YUV plane support Mesa DRM Explicit fence and EGL_ANDROID_native_fence_sync support Video playback w/ V4L2 h/w codec software rendering support Not many h/w choices with mainline (or mainline ABI) support Needs an OpenMax to V4L2 layer Probably some buffer allocation issues and more YUV formats Mali (blob) and Mesa co-existence do we care?

HAL Consolidation - WiFi/BT Integrated generic and QCom WiFi into build Investigate moving QCom WiFi specifics into kernel Needs more testing with different devices (e.g. USB WiFi) How to handle firmware? Include linux-firmware? UART attached device kernel support (https://lwn.net/articles/700489/) Treat UART attached devices the same as any other bus (USB, PCI, SDIO, SPI, etc.) Moves the userspace device management (firmware load, serial config, PM, etc.) to kernel Move serio framework out of drivers/input/ Extend serio from character at a time to buffer at a time API Make tty_port usable for in-kernel drivers (i.e. serio host driver) Help needed to test devices

New developments with AOSP and the kernel AOSP EAS (Energy Aware Scheduler) Integration Sync API Changes in 4.6+ Arm64 KASLR and hardened user copy backport from upstream

AOSP Energy Aware Scheduler Integration John Stultz <john.stultz@linaro.org>

EAS: A common topic at Connect LAS16-TR04: Using tracing to tune and optimize EAS LAS16-410: Window Based Load Tracking (WALT) versus PELT utilization LAS16-307: Benchmarking Schedutil in Android LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel BKK16-317: How to generate power models for EAS and IPA (x2) BKK16-311: EAS core upstreaming strategy BKK16-208: EAS SFO15-411: Energy Aware Scheduling: Power vs. Performance policy (x2) SFO15-302: EAS Policy LCU14-507: Chromebook2 EAS Enablement LCU14-410: How to build an Energy Model for your SoC LCU14-406: A QuIC Take on Energy-Aware Scheduling LCU14-402: Energy Aware Scheduling: Kernel summit update LCA14-109: Path to Energy Efficient Scheduler LCU13 Power-efficient scheduling, and the latest news from the kernel summit LCE13: Why all this sudden attention on the Linux Scheduler?

We ve heard quite a bit about EAS Now, how does one use it with Android?

Kernel side Need the EAS patchset Currently v5.2 Already in common/android-3.18 and common/android-4.4 Includes: ENGINEERS AND DEVICES WORKING TOGETHER EAS core Schedfreq (cpufreq gov) Schedtune (boosting mechanism) WALT (PELT load-tracking replacement) Need energy model for board Not going to cover this

Kernel Config ENGINEERS AND DEVICES WORKING TOGETHER CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED=y CONFIG_CPU_FREQ_GOV_SCHED=y CONFIG_CGROUP_SCHEDTUNE=y CONFIG_SCHED_TUNE=y CONFIG_SCHED_WALT=y CONFIG_WQ_POWER_EFFICIENT_DEFAULT=y CONFIG_DEFAULT_USE_ENERGY_AWARE=y

AOSP Integration Components Basic concepts Three components: ActivityManager & Schedpolicy Init setup powerhal ENGINEERS AND DEVICES WORKING TOGETHER

Conceptual Android task types TOP_APP FOREGROUND BACKGROUND SYSTEM AUDIO_APP AUDIO_SYS ENGINEERS AND DEVICES WORKING TOGETHER

Activity Manager & Schedpolicy Activity manager Tracks foreground and background tasks Adjusts things like timerslack Schedpolicy Handles moving tasks between cgroups, lower-level interfaces ENGINEERS AND DEVICES WORKING TOGETHER

Multiple approaches used Base scheduler behavior Cpusets Cpuctl Schedtune boosting Interactive touch-boosting ENGINEERS AND DEVICES WORKING TOGETHER

Device BoardConfig.mk ENABLE_CPUSETS := true ENABLE_SCHEDBOOST := true ENABLE_SCHED_BOOST := false (Deprecated foreground boosting for big.little HMP scheduler) ENGINEERS AND DEVICES WORKING TOGETHER

Cpusets: Limit what runs where Top-app Foreground Background System-background Foreground-boost (Deprecated!) ENGINEERS AND DEVICES WORKING TOGETHER

Cpusets Little Core Little Core Big Core Big Core

Cpusets Little Core Little Core Big Core Big Core Background

Cpusets Little Core Little Core Big Core Big Core Background Foreground

Cpusets Little Core Little Core Big Core Big Core Background System-Background Foreground

Cpusets Little Core Little Core Big Core Big Core Background System-Background Foregroundboost Foreground

Cpusets Little Core Little Core Big Core Big Core Background Foregroundboost System-Background Foreground Top-App

Init cpuset config (init.hikey.rc) # Foreground should contain most cores write /dev/cpuset/foreground/cpus 0-6 # top-app gets all cores (7 is reserved for top-app) write /dev/cpuset/top-app/cpus 0-7 #background contains a small subset (generally one little core) write /dev/cpuset/background/cpus 0 ENGINEERS AND DEVICES WORKING TOGETHER # add system-background cpuset, a new cpuset for system services # that should not run on larger cores # system-background is for system tasks that should only run on # little cores, not on bigs to be used only by init write /dev/cpuset/system-background/cpus 0-3

Init cpuset config (init.bullhead.rc) # foreground gets all CPUs except CPU 3 # CPU 3 is reserved for the top app write /dev/cpuset/foreground/cpus 0-2,4-5 write /dev/cpuset/foreground/boost/cpus 4-5 write /dev/cpuset/background/cpus 0 write /dev/cpuset/system-background/cpus 0-2 write /dev/cpuset/top-app/cpus 0-5 ENGINEERS AND DEVICES WORKING TOGETHER

Cpuctl: Restrict cputime bg_non_interactive cgroup Keeps background tasks to only small portion of little core ENGINEERS AND DEVICES WORKING TOGETHER

Cpuctl: (system/core/rootdir/init.rc:) # Create cgroup mount points for process groups mkdir /dev/cpuctl mount cgroup none /dev/cpuctl cpu chown system system /dev/cpuctl chown system system /dev/cpuctl/tasks chmod 0666 /dev/cpuctl/tasks write /dev/cpuctl/cpu.rt_runtime_us 800000 write /dev/cpuctl/cpu.rt_period_us 1000000 ENGINEERS AND DEVICES WORKING TOGETHER mkdir chown chmod # 5.0 write write write /dev/cpuctl/bg_non_interactive system system /dev/cpuctl/bg_non_interactive/tasks 0666 /dev/cpuctl/bg_non_interactive/tasks % /dev/cpuctl/bg_non_interactive/cpu.shares 52 /dev/cpuctl/bg_non_interactive/cpu.rt_runtime_us 700000 /dev/cpuctl/bg_non_interactive/cpu.rt_period_us 1000000

Schedtune: Runtime Boost-Knob System wide: sched_cfs_boost Per-cgroup : schedtune.boost Adds a margin to load-tracking accounting, making scheduler think there is more work to be done, which likely raises the cpufreq Image from: http://www.linaro.org/blog/core-dump/energy-aware-scheduling-eas-progress-update/

Schedtune: Default Boosting Foreground (everything else) ENGINEERS AND DEVICES WORKING TOGETHER

Init stune config (init.hikey.rc) # # EAS # chown chown chown write write write ENGINEERS AND DEVICES WORKING TOGETHER stune boosting interfaces system system /dev/stune/foreground/schedtune.boost system system /dev/stune/foreground/schedtune.prefer_idle system system /dev/stune/schedtune.boost /dev/stune/foreground/schedtune.boost 10 /dev/stune/foreground/schedtune.prefer_idle 1 /dev/stune/schedtune.boost 0

Android PowerHAL Provides interactivity signals from userspace POWER_HINT_INTERACTION POWER_HINT_VSYNC POWER_HINT_LOW_POWER POWER_HINT_SUSTAINED_PERFORMANCE POWER_HINT_VR_MODE ENGINEERS AND DEVICES WORKING TOGETHER Deprecated?: POWER_HINT_VIDEO_ENCODE POWER_HINT_VIDEO_DECODE

For old interactive cpufreq gov Set the boostpulse_duration on init: # boost for 1sec echo 1000000 > \ /sys/devices/system/cpu/cpufreq/interactive/boostpulse_duration On POWER_HINT_INTERACTION: echo 1 > /sys/devices/system/cpu/cpufreq/interactive/boostpulse ENGINEERS AND DEVICES WORKING TOGETHER

For EAS w/ schedtune The kernel doesn t do deboosting! On POWER_HINT_INTERACTION: echo 40 > /dev/stune/foreground/schedtune.boost Wait some time then: ENGINEERS AND DEVICES WORKING TOGETHER echo 10 > /dev/stune/foreground/schedtune.boost

Example touch-boost implementation static void schedtune_power_init(struct hikey_power_module *hikey) { hikey->deboost_time = 0; sem_init(&hikey->signal_lock, 0, 1); pthread_create(&tid, NULL, schedtune_deboost_thread, hikey); } static int schedtune_boost(struct hikey_power_module *hikey) { long long now; pthread_mutex_lock(&hikey->lock); now = gettime_ns(); if (!hikey->deboost_time) { schedtune_sysfs_boost(hikey, SCHEDTUNE_BOOST_INTERACTIVE); sem_post(&hikey->signal_lock); } hikey->deboost_time = now + SCHEDTUNE_BOOST_TIME_NS; pthread_mutex_unlock(&hikey->lock); return 0; } static void* schedtune_deboost_thread(void* arg) { struct hikey_power_module *hikey = (struct hikey_power_module *)arg; while(1) { sem_wait(&hikey->signal_lock); while(1) { long long now, sleeptime = 0; pthread_mutex_lock(&hikey->lock); now = gettime_ns(); if (hikey->deboost_time > now) { sleeptime = hikey->deboost_time - now; pthread_mutex_unlock(&hikey->lock); nanosleep_ns(sleeptime); continue; } schedtune_sysfs_boost(hikey, SCHEDTUNE_BOOST_NORM); hikey->deboost_time = 0; pthread_mutex_unlock(&hikey->lock); break; } } return NULL; } See full source here: https://android.googlesource.com/device/linaro/hikey/+/master/power/power_hikey.c

Other conceptual complications Negative boosting: Use schedtune to further reduce cpufreq for background or other groups ENGINEERS AND DEVICES WORKING TOGETHER schedboost.prefer_idle: Prefer to place tasks on idle cpus. Gives a bit more responsiveness but costs some power. Consider for foreground tasks

Thanks! Questions? <john.stultz@linaro.org>

Sync API changes in 4.6+ John Stultz <john.stultz@linaro.org>

Sync API changes in 4.6+ Android Sync API in staging has been refactored and pulled mostly out of staging into the DRM fences and sync_file code. ENGINEERS AND DEVICES WORKING TOGETHER Major credit to Gustavo Padovan <gustavo.padovan@collabora.co.uk> for this work!

Good News! Proper sync/fence api in upstream kernel! One less Android specific kernel feature!

Bad News! Your out-of-tree vendor graphics driver is now terribly, terribly broken!

Old Android Sync Concept sync_timeline: 1 2 3......... 16 17 18 19

Old Android Sync Concept sync_timeline: 1 2 3 sync_pt:......... 16 17 18 19 2

Old Android Sync Concept sync_timeline: 1 2 3 sync_pt: 6 7 8... 2...... 16 17 18 19... 33 34 35 36 8......

Old Android Sync Concept sync_timeline: 1 2 3......... 16 17 18 19... 33 34 35 36 sync_fence: sync_pt: 6 7 8 2 8......

DRM Fences in Concept context: 1 2 3......... 16 17 18 19... 33 34 35 36 sync_file: fence: 6 7 8 2 8......

Kernel transition from old API ENGINEERS AND DEVICES WORKING TOGETHER Old API New API CONFIG_SYNC CONFIG_SYNC_FILE struct sync_fence struct sync_pt struct sync_file struct fence sync_fence_put() sync_fence_fdget() fput(fence->file) sync_file_get_fence() sync_fence_wait_async() sync_fence_cancel_async() fence_add_callback() fence_remove_callback() sync_timeline_create() sync_timeline_signal() fence_context_alloc() fence_signal() sync_pt_create() fence_init() sync_timeline_ops fence_ops

No exact matches Async waits were previously done on sync_fences - Which are closest to sync_files Now async callbacks are done on fences - Which were analogous to sync_pts sync_timelines were objects contexts are just a unique 64-bit id Drivers have to manage their own context objects

In addition... Most graphics drivers have their own higher-level meta-infrastructure that overlaps functionality: struct mali_timeline struct mali_timeline_point struct mali_timeline_fence What do you do with something like: mali_timeline_sync_fence_create_and_add_tracker()

Bonus! Some changes for DRM fences are still in flight

Good luck rewriting your driver. It wouldn t be so bad if your driver was upstream.

Userspace libsync changes Gustavo s libsync tree: https://git.collabora.com/cgit/user/padovan/android-system-core.git Rob Herring s DRM HWC changes: https://github.com/robherring/drm_hwcomposer/commits/android-m ENGINEERS AND DEVICES WORKING TOGETHER

References Eric Gilling s LPC13 talk: https://www.youtube.com/watch?v=rhnritgn4-m Riley Andrew s LPC14 talk: https://linuxplumbersconf.org/2014/ocw/system/presentations/2355/original/03%20-%20sync%20&%20dma-fence.pdf Gustavo s LinuxCon16 talk: http://padovan.org/pub/gustavopadovan-explicit-fencing_talk.pdf Gustavo s Blog post: http://padovan.org/blog/2016/09/mainline-explicit-fencing-part-1/

Thanks! Questions? <john.stultz@linaro.org>

ION Who is using Ion? Who wants to use Ion on mainline? What are you using Ion for? Do you need kernel APIs? What out of tree Ion features are you missing? What help do you need? Can you help with testing?

Reducing bootup time AOSP is increasingly being used in non-phone environments, where boot times matter much more (e.g. automotive). What can we do to improve boot times? Some measuring to help us decide, run on HiKey 2GB Ram version from LeMaker Android Nougat 7.0.0_r6 Kernel AOSP android-hikey-linaro-4.4 HDMI, Micro-USB, Serial Console connected Soft boot with reboot command for 2nd boot

Boot Time Percentage(Total) From surfaceflinger service started to UI displayed (65%) init: Starting service 'surfaceflinger'... From dmesg Kernel boot time(26.5%) Freeing unused kernel memory From dmesg From Init started to surfaceflinger service started(8.5%) Boot is finished (14907 ms) From logcat

Boot Time Percentage(Boot Progress) From preload_start to preload_end(23.4%) From pms_system_scan_start to pms_data_scan_start(15%) From boot_progress_start to preload_start(14.9%) From pms_ready to ams_ready(13.2%) From ams_ready to enable_screen(11.2%) The top 5 take 77.7% in total

Measurements Target Method Kernel boot time Information from dmesg Comments Android boot time before surfaceflinger service started Android boot time from surfaceflinger service started to Launcher displayed Others like Application start time, web site loading time, media app start time.? Can not measure time for bootloader automatically Need extra tools for accurate measurements Information from dmesg bootchart Services like vold, debuggerd are started here Information from logcat(including the events buffer) bootchart Timestamp in dmesg and logcat are not the same for the same message What others we want to check as well?

Reducing bootup time What can we do to improve boot times? Suspend to disk instead of complete shutdown? Parallel init? Launch extra services after UI is up? Better file system type for system/userdata/cache partitions????

Out of tree AOSP userspace patches Keeping a number of out of tree patches can become more problematic than it already is - with the move to more frequent security updates and the appearance of Android One-style devices, maintaining extra patches becomes more work. Upstreaming more important than ever Will try hard to upstream Linaro patches Do members need/want help upstreaming patches from their vendor trees/bsps? Is licensing sorted out? Do we keep some patches Members-First? What can we do about patches getting stuck in the upstream review queue? How will we handle out-of-tree patches that can t go upstream (e.g. rejected patches that still matter to a member) in the future? Patchset scripts vs. committing to git repositories?

AOSP transition to clang As of AOSP N, AOSP s primary toolchain is clang - based on a recent 4.0 snapshot. Being able to build all of AOSP with clang was largely Linaro s work gcc is still used to build some HALs for old devices and the kernel We can build the HiKey kernel with clang now - with a few patches and a few ugly workarounds that need to be fixed Resulting system works, but has some stability issues that need to be debugged Point of discussion: Do we need to patch support for building with gcc back in?

AOSP with upstream clang (especially TOT) Primary reasons for this work Clang in AOSP toolchain is 5+ months behind compared to tot upstream clang Enable monitoring the impact of upstream clang on AOSP (mainly for performance) Enable safe landing of clang's latest code onto AOSP when time is come Linaro's current efforts Downstream patches of AOSP clang now all upstreamed (thanks to Renato and Google folks) AOSP master can be built with upstream clang (at July) successfully Monitoring compilation of AOSP master with tot upstream clang Not have been tested for boot-up yet (See Future work for CI) 3 clang bugs reported (1 fixed 2 open) 1 AOSP bionic patch upstreamed Future work CI for building AOSP master with upstream clang is in progress for boot-up and benchmark tests Continuously finding and fixing problems that prevent successful compilation e.g. new warning (-address-of-packed-member ) added in tot clang causes compilation failure.

VIXL: A Programmatic Assembler and Disassembler for AArch32 Anton Kirilov Linaro ART team

Agenda What is VIXL? Assembler Disassembler VIXL in the Android Runtime 91

What is VIXL? A programmatic assembler and disassembler Does not process text files Originally designed for JIT compilers Supports AArch32 (both A32 and T32) and AArch64 Written in C++ Uses the modified BSD license Used by the Android Runtime, QEMU, HHVM, etc. Also, simulator and debugger for AArch64 This presentation will concentrate on AArch32 92

Useful links Download: git clone https://review.linaro.org/arm/vixl For AArch64 refer to the SFO15 500 presentation VIXL: http://connect.linaro.org/resource/sfo15/sfo15-500-vixl/ 93

Assembler The basic low-level interface is the Assembler class Provides full control over code generation (e.g. the exact encoding used) Declared in aarch32/assembler-aarch32.h Generates A32 code by default, but can be changed: By the constructor On-the-fly, e.g. by the UseA32()/UseT32() methods Possible to mix A32 and T32 instructions in the generated code 94

The Assembler class Let s start with a simple factorial: unsigned factorial(unsigned x) { unsigned r = 1; while (x) { r *= x--; } return r; } In T32 assembly: factorial: movs r1, r0 mov r0, #1 it eq bxeq lr loop: mul r0, r0, r1 subs r1, #1 it ne bne loop bx lr 95

The Assembler class With VIXL: (continued from the left) Assembler as(t32); as.bind(&loop); Label factorial; Label loop; as.mul(r0, r0, r1); as.subs(r1, r1, 1); as.it(ne); as.bind(&factorial); as.movs(r1, r0); as.mov(r0, 1); as.b(&loop); as.bx(lr); as.it(eq); as.bx(lr); as.finalizecode(); 96

The Assembler class limitations No code buffer overflow check The caller is responsible No automatic generation of large constants Immediate operands of instructions such as MOV, etc. Branch offsets The Assembler methods will print an error message if a large constant is passed Consequence of being a low-level interface 97

Macro assembler Implemented by the MacroAssembler class Declared in aarch32/macro-assembler-aarch32.h Uses the assembler internally The interface is mostly the same The macro assembler-specific method names are capitalized Provides some extra features that make programming easier and safer Veneers for branch offsets that can t be encoded Literal pools Further examples follow It is the expected end-user interface 98

Macro assembler example Source code: Generated code: MacroAssembler masm(t32); mov ip, #22136 masm.add(r0, r0, 0x12345678); movt ip, #4660 add r0, ip masm.finalizecode(); 99

Further macro assembler example Performs simple optimizations: MacroAssembler masm(t32); Generated code: mvn r0, #255 masm.mov(r0, 0xFFFFFF00); masm.finalizecode(); 100

The UseScratchRegisterScope class Structured way to deal with scratch registers The IP register (R12) in particular should not be used directly Follows a standard C++ idiom: MacroAssembler as(t32); { UseScratchRegisterScope temps(&as); Register temporary = temps.acquire(); as.mov(temporary, 0x12345678); as.add(r0, temporary, temporary); } 101

Macro assembler pitfalls Consider the following situation (assuming we could access the IT instruction in the macro assembler): Generated code: MacroAssembler masm(t32); it eq masm.it(eq, 0x8); moveq r1, #22136 movt r1, #4660 add r1, r0, r1 masm.add(r1, r0, 0x12345678); masm.finalizecode(); 102

The AssemblerAccurateScope class Helps to control the number of generated instructions Prevents the assembler from emitting veneers and literal pools In fact, in situations like these the assembler must be used Provides bounds checking for the assembler Note that the constructor uses a size in bytes, not number of instructions 103

AssemblerAccurateScope example MacroAssembler masm(t32); { AssemblerAccurateScope aas(&masm, 4 * k32bitt32instructionsizeinbytes, CodeBufferCheckScope::kMaximumSize); masm.ittt(eq); masm.mov(r1, 22136); masm.movt(r1, 4660); masm.add(r1, r0, r1); } masm.finalizecode(); 104

Assembler vs. MacroAssembler The following table summarizes the differences: Assembler MacroAssembler Control over the generated code precise relaxed Code simplifications no yes Convenience no yes 105

Disassembler Implemented in the Disassembler class Declared in aarch32/disasm-aarch32.h Strives for strict ARMv8 compliance The main entry points are the DecodeA32() and DecodeT32() methods A little bit low-level for most use cases, especially when dealing with the variable-length T32 instructions 106

The PrintDisassembler class Provides a more convenient interface Most applications will probably use it instead of directly the disassembler Provides methods to disassemble a whole buffer of instructions: DisassembleA32Buffer() DisassembleT32Buffer() Also, a way to process a single instruction more conveniently (particularly for T32): DecodeA32At() DecodeT32At() 107

PrintDisassembler example Continuing our assembler example: Output: Assembler as(t32); 0x00000000 0001 0x00000002 f04f0001 mov r0, #1 as.bind(&factorial); 0x00000006 bf08 it eq as.movs(r1, r0); 0x00000008 4770 bxeq lr 0x0000000a fb00f001 mul r0, r0, r1 PrintDisassembler disasm(std::cout); 0x0000000e 1e49 subs r1, #1 0x00000010 bf18 it ne 0x00000012 e7fa bne 0x0000000a 0x00000014 4770 bx lr disasm.disassemblet32buffer( as.getstartaddress<uint16_t *>(), movs r1, r0 as.getsizeofcodegenerated()); 108

The DisassemblerStream class The main approach to customize the disassembler output Used internally by the disassembler Each instruction is broken down into components by the disassembler, e.g.: Register MemOperand etc. The DisassemblerStream defines operators for processing each component Override the operator of interest to change the output 109

DisassemblerStream example Assigning a special name to a register: class RegisterPrettyPrinter : public DisassemblerStream { DisassemblerStream& operator<<(const Register reg) override { if (reg.is(r9)) { os() << "tr"; return *this; } else { return DisassemblerStream::operator<<(reg); } } }; 110

More examples and documentation Look into the examples/aarch32 directory in the VIXL source tree An excellent starting point for a beginner: doc/getting-started-aarch32.md 111

VIXL in the Android Runtime The ART team has been working on integrating VIXL into the AArch32 backend Lead to a safer and more extensible code base Mechanisms such as the UseScratchRegisterScope class provide better detection of mistakes The majority of the assembler and disassembler are automatically generated it should be much easier to support future ISA additions Much more extensive testing 112

Thank You #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org

Android Runtime Performance Analysis Artem Serov Linaro ART team 114

Agenda Introduction Performance measurement Performance analysis ENGINEERS AND DEVICES WORKING TOGETHER 115

Linaro ART Team Android Runtime (ART) The managed runtime used by Java applications (Dex bytecode) and some system services on Android https://source.android.com/devices/tech/dalvik/ Android 6.0 or 7.0 Hybrid Mode (AOT) ART JIT in Android N by Xueliang ZHONG Linaro ART team Working on Android Runtime - improving the performance and stability Members and assignees from ARM, Spreadtrum, Mediatek 116

Art-testing Art-testing repository Benchmarks Recognized benchmarks: CHECKED! Embeddable CHECKED! Stable and reproducible CHECKED! Recognized CHECKED! Microbenchmarks Caffeinemark Benchmarksgame Stanford Richards Deltablue etc Analyzable and flexible New features Catch regressions Framework $./run.py --target --iterations 10 Host and target Statistics Perf tools 117

Performance: Per-patch What we re currently talking about Patch delivery life cycle Benchmarking Investigation Developing Code Review Testing Merging Perf Analysis 118

Performance Tracking Per-patch Performance comparing before and after We want to make sure that patches improve performance and don t bring unexpected degradation Continuous tracking Regressions and anomalies whenever they happen Upstream changes Linaro patches tracking (double checking) 119

Agenda Introduction Performance measurement Performance analysis ENGINEERS AND DEVICES WORKING TOGETHER 120

Performance: Continuous Tracking 121

Build-scripts Automated process to run benchmarks: building, configuring, running Android root chroot like (/system -> /data/local/tmp/system) Do not depend on other AOSP projects and GUI environment Device configuration: CPU frequencies and clusters (big/little)./scripts/benchmarks/benchmarks_run_target.sh --mode 32 --cpu little --iterations 10 Stable results CPU pinning Overheating Running the benchmarks 122

Performance: Per-patch: Automation 123

Performance: Manual Check geomean diff (%) ---------------------- geomean error 1 (%) geomean error 2 (%) ---------------- ------------------- -------------------... caffeinemark/loopatom -24.423 0.017 0.258... --- Summary ---------------------------------------------------------------------intrinsics -5.436 0.266 0.443 micro -3.536 0.122 0.145 benchmarksgame -2.745 0.495 0.616 algorithm -12.280 0.214 0.205 stanford -7.553 0.269 0.167 math -0.392 0.490 0.210 caffeinemark -5.639 0.315 0.372 OVERALL -4.762 0.099 0.116 ----------------------------------------------------------------------------------- We measure time that s why negative numbers mean improvement 124

Agenda Introduction Performance measurement Performance analysis ENGINEERS AND DEVICES WORKING TOGETHER 125

Performance Analysis: Example caffeinemark/loopatom.java: - 24.4% reduction of execution time (improvement) Task: Investigate the reason for performance difference Run perf-tools to collect data for A (before) and B (after) builds for caffeinemark/loopatom.java 126

Performance Analysis: Hotspots Hotspots - sections of code that get most of execution time Generic naive algorithm Find hotspots: Profiling Typically there are few hotspots which determine overall performance Method level Loop level Instruction level Find hotspots Analyze hotspots... Analyze hotspots Source code Binary code Tools PROFIT! 127

Art-testing Perf Tools Performance analysis of code generated by ART Scripts based on linux-perf-tools Not kernel Not native libraries Linux profiling with performance counters Statistical profiling Features Profiling - hotspots identification.cfg (IR + assembly) files generation Perf events collection All of the above in one click! 128

Identifying Hotspots: Methods Java methods benchmark.oat boot.oat Native methods Kernel Libart Others %------- Events----- ---- DSO--------------------------------- k?-- method------------------------------------------ + 89.37% 10768209032 main data@local@tmp@bench.apk@classes.dex [.] int benchmarks.caffeinemark.loopatom.execute() + 3.30% 397223644 main linker [.] dl ZNK6soinfo10gnu_lookupER10SymbolName... + 1.03% 123764181 main [kernel.kallsyms] [k] 0xffffffc0002896c0 + 0.92% 110683861 main linker [.] dl ZNK6soinfo19find_symbol_by_nameER10S... + 0.80% 96237748 main linker [.] dl ZN10SymbolName8gnu_hashEv 129

Hotspot:.CFG File c1visualizer - tool to visualize ART intermediate representation (IR) Control flow graph (CFG) IR Assembly For each method before and after each optimization 130

Identifying Hotspots public int execute() {... 5.26 for(int j = 0; j < FIBCOUNT; j++) { for(int k = 1; k < FIBCOUNT; k++) { 5.43 j1 += l; 5.45 add.w ip, r3, r0, lsl 15.58 ldr.w r2, [ip, #12] add.w ip, r3, fp, lsl 14.74 ldr.w r4, [ip, #12] 9.63 cmp r2, bge.n 32b6 2.22 add.w ip, r3, r0, lsl 6.45 str.w r4, [ip, #12] add.w ip, r3, fp, lsl 6.72 str.w r2, [ip, #12] 2.66 add.w fp, fp, #1 2.50 ldr r4, [sp, #4] 2.65 movs r2, #1 2.42 movs r0, #0 2.80 ldrh.w ip, [r9] 10.50 cmp.w ip, #0 beq.n 3284 b.n 3314 if(fibs[k - 1] < fibs[k]) { int i1 = fibs[k - 1]; fibs[k - 1] = fibs[k]; fibs[k] = i1; } } } cmp fp, bge.n 32cc add sl, add.w r8, r8, #2 add.w r0, fp, #4294967295 ; int l = FIBCOUNT + dummy; k1 += 2; } 4.90 #2 #2 #2 #2... 131

IR: c1visualizer 132

IR: c1visualizer 133

Analyzing Hotspot:.CFG File IR A (before the patch) IR B (after the patch) i75 ArrayGet [l6,i71] l18 IntermediateAddress [l6,i184] add.w ldr.w r12, r3, r0, lsl #2 r2, [r12, #12] i81 ArrayGet [l6,i51] add.w r12, r3, r11, lsl #2 ldr.w r4, [r12, #12] add r4, r3, #12 i75 ArrayGet [l185,i71] ldr.w r5, [r4, r2, lsl #2] i81 ArrayGet [l185,i51] ldr.w r6, [r4, r0, lsl #2] 134

Perf Events PMU counters - look into ARM Infocenter EPI - event per 1000 instructions IPC - instructions per cycle - increased from 0.97 to 1.23 Events are sorted by EPI A Event Descriptions Total events A Total events B Diff EPI A EPI B cycles Hardware event 4565192341 3433910203-24.78% 1024 812 instructions Hardware event 4456377599 4229550675-5.09% 1000 1000 0x14 L1 Instruction cache access 2142888354 1698883460-20.72% 481 402 0xE5 load/store instruction waiting for data to calculate the address in the AGU. 1330740047 225172166-83.08% 299 53 135

DS-5: Streamline Performance Analyzer ARM DS-5 Streamline - system-wide performance analysis tool Features: PMU counters Timeline Filter by processes and threads Multicore, multicluster and big.little Add custom annotations Overlay charts and customize expressions Mali GPU Optimization Supports Linux, Android For Android use the tutorial 136

Streamline: Diagram Example 137

Useful Links 1. https://source.android.com/ - Android Open Source Project 2. https://source.android.com/devices/tech/dalvik/ - Android Runtime 3. https://android-git.linaro.org/gitweb/linaro/art-testing.git/tree - Linaro benchmarks and tools repository 4. https://java.net/projects/c1visualizer/ - tool to visualizer ART intermediate representation 5. https://perf.wiki.kernel.org/index.php/main_page - Linux profiling with performance counters 6. https://developer.arm.com/products/software-development-tools/ds-5-develo pment-studio/streamline - ARM Streamline Performance Analyzer 7. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500f/biidba FB.html - Cortex-A53 Performance Monitor Unit Events 138

Thank You #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org 139

Android Runtime: Metrics Compilation Memory footprint Static: How much storage is required for the app binary Dynamic: How much RAM is consumed when the app is running Run-time performance How long it takes to compile the app How much RAM is consumed during app compilation The quality of the generated code From this point performance = run-time performance 140

Backup: Identifying hotspots 1. 2. 3. 4. Check performance difference (in %) Skim over the bench sources Run perf with cycles event Identify the hotspots using perf report a. b. c. d. Single very hot java leaf method Non-leaf java method not from boot.oat Native method Java method from boot.oat 5. Examine the hotspot a. b. Validate that this particular hotspot determines the difference in total performance Split and alter big methods 141

Backup: Analyzing Hotspots 1. Get.cfg file for the method 2. Identify the exact piece of hot code a. b. Loops Perf-annotate 3. Compare the corresponding IR (c1visualizer) a. find the compiler phase where difference occur 4. Compare the corresponding assembly (c1visualizer) 5. Use static binary built from assembly for performance difference validation 6. Run perf scripts will all PMU events a. b. c. Total-period option CPI Cycle per instruction reflect the performance ${Counter} / instructions * 1000 reflect counter s impact 142

O and on: What s in AOSP s future and how can we help? New partition layout, A/B updates What else would we LIKE to see in AOSP s future? and how can we help bring it about?

Anything else? Did the topics of the microconference bring up another topic we should be talking about? Did we omit an important topic? Feel free to talk about anything AOSP related now...

Thank You #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org

Memory allocator analysis Primary focus: Reduce memory usage on low-memory devices Malloc implementations investigated: jemalloc, dlmalloc, nedmalloc, tcmalloc, musl malloc, TLSF, lockless allocator Memory analysis briefing https://docs.google.com/document/d/15ycueuplwzs0lpxmvc6ykralonfly8qe2ma7kkeoa_o /edit#heading=h.z9n368sk0eai Challenges Porting of atomic routines for ARM 64-bit platform name mangling issues C99 warnings Wrapper/dummy calls for bionic integration (e.g. malloc_usable_size, malloc_disable, mallinfo etc.) Other runtime issues Benchmark porting - (tlsf-test, t-test) Fragmentation analysis script

Memory allocator analysis Summary tcmalloc, jemalloc wins for multi-threaded apps and run time performance (good amount of small pages available at runtime) static size reduction for libc is improved with nedmalloc and tlsf jemalloc-svelte does not have good stand compare to jemalloc & tcmalloc Support issue with nedmalloc - no more support. Lockless allocator - under private license Note: Rank graph is generated based on relative performance. For real numbers kindly refer to memory analysis document