PERF performance-counter for Odroid XU3/XU4

Similar documents
Profiling: Understand Your Application

Kernel perf tool user guide

ECE 571 Advanced Microprocessor-Based Design Lecture 2

ECE 471 Embedded Systems Lecture 23

ECE 571 Advanced Microprocessor-Based Design Lecture 2

Rover Documentation Tracing with Perf, Conversion to CTF, and analysis with TraceCompass

ΕΛΠ 605: Προχωρηµένη Αρχιτεκτονική Υπολογιστών. Εργαστήριο Αρ. 4. Linux Monitoring Utilities (perf,top,mpstat ps, free) and gdb dissasembler, gnuplot

System administration

Tracing Lustre. New approach to debugging. ORNL is managed by UT-Battelle for the US Department of Energy

ECE 471 Embedded Systems Lecture 23

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Linux Strace tool user guide

EE382M 15: Assignment 2

MemGuard on Raspberry Pi 3

Linux perf. for Qt developers

Evaluating Performance Via Profiling

RALPH BÖHME, SERNET, SAMBA TEAM UNDERSTANDING AND IMPROVING SAMBA FILESERVER PERFORMANCE HOW I FELL IN LOVE WITH SYSTEMTAP AND PERF

Processors, Performance, and Profiling

Perf with the Linux Kernel. Copyright Kevin Dankwardt

Square Pegs in Round holes. Paweł Moll

HPC Lab. Session 4: Profiler. Sebastian Rettenberger, Chaulio Ferreira, Michael Bader. November 9, 2015

Quality in the Data Center: Data Collection and Analysis

Efficient and Large Scale Program Flow Tracing in Linux. Alexander Shishkin, Intel

Xenoprof overview & Networking Performance Analysis

Final Step #7. Memory mapping For Sunday 15/05 23h59

Jackson Marusarz Intel Corporation

Use Dynamic Analysis Tools on Linux

Ftrace Profiling. Presenter: Steven Rostedt Red Hat

OpenCL Implementation and Performance Verification on R-Car H3/AGL

Fosdem perf status on ARM and ARM64

perf scripts jiri olsa PERF SCRIPTS JIRI OLSA

Linux Kernel on RISC-V: Where do we stand?

HOW I LEARNED TO LOVE PERF AND SYSTEMTAP

Host-Assisted Virtual Machine Tracing and Analysis

Yocto Project components

Practical Verification for Edge AI use and Effort for Functional Improvement

Simulating Multi-Core RISC-V Systems in gem5

Performance Profiling


When the OS gets in the way

CS 310: Memory Hierarchy and B-Trees

Breaking Kernel Address Space Layout Randomization (KASLR) with Intel TSX. Yeongjin Jang, Sangho Lee, and Taesoo Kim Georgia Institute of Technology

Writing high performance code. CS448h Nov. 3, 2015

Linux Perf Tools. Overview and Current Developments. Arnaldo Carvalho de Melo, Jiri Olsa. May 24, Red Hat Inc.

Cubieboard4 Linux Sdk Guide TF BOOT & TF WRITE EMMC. Website: Support:

Chromium OS audio. CRAS audio server

Raspberry Pi Network Boot

Intel VTune Amplifier XE

F28HS Hardware-Software Interface: Systems Programming

Profiling and Debugging Games on Mobile Platforms

Zephyr Kernel Installation & Setup Manual

2

CS3210: Virtual memory. Taesoo Kim w/ minor updates K. Harrigan

CS3210: Multiprocessors and Locking

Accurate and Stable Empirical CPU Power Modelling for Multi- and Many-Core Systems

Android. Separated Kernel build might break the Android build process. Toolchain

The TinyHPC Cluster. Mukarram Ahmad. Abstract

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

CS 152 Computer Architecture and Engineering

Mental models for modern program tuning

CS333 Project 1 Test Report Your Name Here

Virtual memory why? Virtual memory parameters Compared to first-level cache Parameter First-level cache Virtual memory. Virtual memory concepts

Enhancing PAPI with Low-Overhead rdpmc Reads

Dongjun Shin Samsung Electronics

Performance Tuning VTune Performance Analyzer

Baking RDKit on a Pi. - Tips and gotchas. Jan Holst Jensen CEO, Biochemfusion

IVI Fast boot approach

Real-Time Cache Management for Multi-Core Virtualization

Testing the Performance Impact of the Exact Match Cache

SNMP MIBs and Traps Supported

RAS Enhancement Activities for Mission-Critical Linux Systems

Transparent Hugepage Support

Simulation-Based Tracing and Profiling for System Software Development

CROWDCOIN MASTERNODE SETUP COLD WALLET ON WINDOWS WITH LINUX VPS

Revolutionizing the Datacenter. Join the Conversation #OpenPOWERSummit

Performance Counters and Tools OpenPOWER Tutorial, SC17, Denver

P6: Trial Build of a ROM Nikhil George. 1. Introduction. Overview of the build task. Cite the build/ wiki articles you read.

Tracing embedded heterogeneous systems

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Linux-Ready RV-GC AndesCore with Architecture Extensions Charlie Su, Ph.D. CTO and SVP 2018/05/09

Mid Term from Feb-2005 to Nov 2012 CS604- Operating System

Module I: Measuring Program Performance

CS3210: Operating Systems

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Operating System. Hanyang University. Hyunmin Yoon Operating System Hanyang University

DEVELOPMENT GUIDE VAB-630. Linux BSP v

Exercise Session 5. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Potentials and Limitations for Energy Efficiency Auto-Tuning

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs. Luca Canali CERN, Geneva (CH)

Hidden Linux Metrics with ebpf_exporter. Ivan Babrou

Ftrace Kernel Hooks: More than just tracing. Presenter: Steven Rostedt Red Hat

Android Debugging and Performance Analysis

Solving Difficult Memory Performance Problems

WHAT YOU WILL NEED FOR THIS GUIDE:

Measuring the impacts of the Preempt-RT patch

Lab1 tutorial CS

Evaluation of Real-time Performance in Embedded Linux. Hiraku Toyooka, Hitachi. LinuxCon Europe Hitachi, Ltd All rights reserved.

Transcription:

2017/12/07 21:49 1/6 PERF performance-counter for Odroid XU3/XU4 PERF performance-counter for Odroid XU3/XU4 Linux hardware performance measurement using counters, trace-points, software performance counters, and dynamic probes. Perf as one of the two most commonly used performance counter profiling tools on Linux. Perf basically use to analyses the core internal bottleneck right up to the driver level. Linux support many profiling tools like perf, trace-cmd, blktrace, strace and oprofile. Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots. perf provides rich generalized abstractions over hardware specific capabilities. Among others, it provides per task, per CPU and perworkload counters, sampling on top of these and source code event annotation. Using perf we could monitor the performance of the device driver. Build Pref tool In order to build perf you need to install following packages. sudo apt-get install flex bison libdw-dev libnewt-dev binutils-dev libauditdev libgtk2.0-dev binutils-dev libssl-dev python-dev systemtap-sdt-dev libiberty-dev libperl-dev liblzma-dev libpython-dev libunwind-* asciidoc xmlto Check out the kernel source code to build the perf executable $ git clone --depth 1 https://github.com/hardkernel/linux -b odroidxu4-4.14.y $ cd linux/tools/perf $ make $ sudo cp perf /usr/bin/perf Note: perf register pmu is integrated in the kernel, so just need to build the perf binary to test. Check if Kernel supports Perf feature or not (Kernel 4.14 or higher is required) root@odroid:~# dmesg grep PMU [ 0.250870] EXYNOS5420 PMU initialized [ 0.749038] hw perfevents: enabled with armv7_cortex_a7 PMU driver, 5 counters available [ 0.750030] hw perfevents: enabled with armv7_cortex_a15 PMU driver, 7 counters available ODROID Wiki - http://wiki.odroid.com/

Last update: odroid-xu4:application_note:software:perf_perfomace_counter_for_odroid_xu3_xu4 http://wiki.odroid.com/odroid-xu4/application_note/software/perf_perfomace_counter_for_odroid_xu3_xu4 2017/11/21 08:28 root@odroid:~# Check a list of perf events we can monitor root@odroid:~# perf list List of pre-defined events (to be used in -e): branch-instructions OR branches branch-misses bus-cycles cache-misses cache-references cpu-cycles OR cycles instructions alignment-faults bpf-output context-switches OR cs cpu-clock cpu-migrations OR migrations dummy emulation-faults major-faults minor-faults page-faults OR faults task-clock L1-dcache-load-misses L1-dcache-loads L1-dcache-store-misses L1-dcache-stores L1-icache-load-misses L1-icache-loads LLC-load-misses LLC-loads LLC-store-misses LLC-stores branch-load-misses branch-loads dtlb-load-misses dtlb-store-misses itlb-load-misses List of pre-defined events (to be used in -e): branch-instructions OR branches branch-misses bus-cycles http://wiki.odroid.com/ Printed on 2017/12/07 21:49

2017/12/07 21:49 3/6 PERF performance-counter for Odroid XU3/XU4 cache-misses cache-references cpu-cycles OR cycles instructions alignment-faults bpf-output context-switches OR cs cpu-clock cpu-migrations OR migrations dummy emulation-faults major-faults minor-faults page-faults OR faults task-clock L1-dcache-load-misses L1-dcache-loads L1-dcache-store-misses L1-dcache-stores L1-icache-load-misses L1-icache-loads LLC-load-misses LLC-loads LLC-store-misses LLC-stores branch-load-misses branch-loads dtlb-load-misses dtlb-store-misses itlb-load-misses armv7_cortex_a15/br_immed_retired/ armv7_cortex_a15/br_mis_pred/ armv7_cortex_a15/br_pred/ armv7_cortex_a15/br_return_retired/ armv7_cortex_a15/bus_access/ armv7_cortex_a15/bus_cycles/ armv7_cortex_a15/cid_write_retired/ armv7_cortex_a15/cpu_cycles/ armv7_cortex_a15/exc_return/ armv7_cortex_a15/exc_taken/ armv7_cortex_a15/inst_retired/ armv7_cortex_a15/inst_spec/ armv7_cortex_a15/l1d_cache/ armv7_cortex_a15/l1d_cache_refill/ armv7_cortex_a15/l1d_cache_wb/ armv7_cortex_a15/l1d_tlb_refill/ armv7_cortex_a15/l1i_cache/ ODROID Wiki - http://wiki.odroid.com/

Last update: odroid-xu4:application_note:software:perf_perfomace_counter_for_odroid_xu3_xu4 http://wiki.odroid.com/odroid-xu4/application_note/software/perf_perfomace_counter_for_odroid_xu3_xu4 2017/11/21 08:28 armv7_cortex_a15/l1i_cache_refill/ Perf Examples root@odroid:~/perf-examples# perf stat -B dd if=/dev/zero of=/dev/null count=1000000 1000000+ records in 1000000+ records out 512000000 bytes (512 MB, 488 MiB) copied, 0.840694 s, 609 MB/s Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000': 842.111288 task-clock (msec) # 0.996 CPUs utilized 1 context-switches # 0.001 K/sec cpu-migrations # 0.000 K/sec 42 page-faults # 0.050 K/sec 1684203841 cycles # 2.000 GHz 1435117503 instructions # 0.85 insn per cycle 311869004 branches # 370.342 M/sec 11924108 branch-misses # 3.82% of all branches 0.845417981 seconds time elapsed root@odroid:~/perf-examples# Note: Exynos5422 is big.little arch so we obtain the counter for each cpu. root@odroid:~/perf-examples# perf stat -B taskset -c 0 dd if=/dev/zero of=/dev/null count=1000000 1000000+ records in 1000000+ records out 512000000 bytes (512 MB, 488 MiB) copied, 1.65277 s, 310 MB/s Performance counter stats for 'taskset -c 0 dd if=/dev/zero of=/dev/null count=1000000': 1655.839284 task-clock (msec) # 0.999 CPUs utilized 7 context-switches # 0.004 K/sec 1 cpu-migrations # 0.001 K/sec 77 page-faults # 0.047 K/sec 1773536 cycles # 0.001 GHz 444207 instructions # 0.25 insn per cycle 93267 branches # 0.056 M/sec 9169 branch-misses # 9.83% of all branches 1.657392774 seconds time elapsed root@odroid:~/perf-examples# perf stat -B taskset -c 4 dd if=/dev/zero http://wiki.odroid.com/ Printed on 2017/12/07 21:49

2017/12/07 21:49 5/6 PERF performance-counter for Odroid XU3/XU4 of=/dev/null count=1000000 1000000+ records in 1000000+ records out 512000000 bytes (512 MB, 488 MiB) copied, 0.809315 s, 633 MB/s Performance counter stats for 'taskset -c 4 dd if=/dev/zero of=/dev/null count=1000000': 811.520288 task-clock (msec) # 0.998 CPUs utilized 6 context-switches # 0.007 K/sec 1 cpu-migrations # 0.001 K/sec 77 page-faults # 0.095 K/sec 1622986577 cycles # 2.000 GHz 1435747079 instructions # 0.88 insn per cycle 311780313 branches # 384.193 M/sec 8700181 branch-misses # 2.79% of all branches 0.812844283 seconds time elapsed root@odroid:~/perf-examples# perf record/report perf record : perf record uses the cycles event as the sampling event. This is a generic hardware event that is mapped to a hardware-specific PMU event by the kernel. perf report: Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. It is possible to customize the sorting order and therefore to view the data differently. root@odroid:~/perf-examples# perf record -a sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.103 MB perf.data (289 samples) ] root@odroid:~/perf-examples# root@odroid:~/perf-examples# perf report Samples: 289 of event 'cycles:ppp', Event count (approx.): 28006656 Overhead Command Shared Object Symbol 40.33% swapper [kernel.vmlinux] [k] arch_cpu_idle 7.23% swapper [kernel.vmlinux] [k] tick_nohz_idle_exit 5.40% swapper [kernel.vmlinux] [k] tick_nohz_idle_enter 3.83% swapper [kernel.vmlinux] [k] _raw_spin_unlock_irq 3.27% sleep [kernel.vmlinux] [k] filemap_map_pages 3.25% perf [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 2.14% sleep [kernel.vmlinux] [k] page_remove_rmap 1.82% perf [kernel.vmlinux] [k] perf_event_ctx_lock_nested 1.78% swapper [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 1.70% ksoftirqd/4 [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 1.68% sleep libc-2.23.so [.] 0x00050840 ODROID Wiki - http://wiki.odroid.com/

Last update: odroid-xu4:application_note:software:perf_perfomace_counter_for_odroid_xu3_xu4 http://wiki.odroid.com/odroid-xu4/application_note/software/perf_perfomace_counter_for_odroid_xu3_xu4 2017/11/21 08:28 1.67% perf [kernel.vmlinux] [k] page_remove_rmap 1.61% perf [kernel.vmlinux] [k] remove_vma 1.51% kworker/u16: [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 1.48% perf [kernel.vmlinux] [k] ext4_da_write_begin 1.44% kworker/u16: [kernel.vmlinux] [k] _find_opp_table_unlocked 1.35% swapper [kernel.vmlinux] [k] exception_text_end 1.33% perf [kernel.vmlinux] [k] alloc_set_pte 1.23% kworker/:1 [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 1.22% perf [kernel.vmlinux] [k] _test_and_set_bit 1.06% perf [kernel.vmlinux] [k] _raw_spin_lock 1.03% kworker/u16: [kernel.vmlinux] [k] update_devfreq_passive 0.83% kworker/u16: [kernel.vmlinux] [k] _raw_spin_unlock_irq 0.80% kworker/:1 [kernel.vmlinux] [k] memchr_inv 0.79% rs:main Q:Reg [kernel.vmlinux] [k] balance_dirty_pages_ratelimited 0.79% rs:main Q:Reg rsyslogd [.] 0x0002c8ae 0.70% sleep [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 0.68% systemd-journal systemd-journald [.] 0x00015f1c 0.61% rs:main Q:Reg [kernel.vmlinux] [k] kmap_atomic 0.54% systemd-journal systemd-journald [.] 0x0002aeac External Links You can find more on following links. https://perf.wiki.kernel.org/index.php/tutorial http://www.brendangregg.com/perf.html From: http://wiki.odroid.com/ - ODROID Wiki Permanent link: http://wiki.odroid.com/odroid-xu4/application_note/software/perf_perfomace_counter_for_odroid_xu3_xu4 Last update: 2017/11/21 08:28 http://wiki.odroid.com/ Printed on 2017/12/07 21:49