HermitCore A Low Noise Unikernel for Extrem-Scale Systems

Size: px

Start display at page:

Download "HermitCore A Low Noise Unikernel for Extrem-Scale Systems"

Nora Bennett
5 years ago
Views:

1 HermitCore A Low Noise Unikernel for Extrem-Scale Systems Stefan Lankes 1, Simon Pickartz 1, Jens Breitbart 2 1 WTH Aachen University, Aachen, Germany 2 Bosch Chassis Systems Control, Stuttgart, Germany

2 Agenda Motivation Challenges for the Future Systems OS Architectures HermitCore Design Performance Evaluation Conclusion and Outlook 2 HermitCore Stefan Lankes et al. WTH Aachen University

3 Motivation Complexity of high-end HPC systems keeps growing Extreme degree of parallelism Heterogeneous core architectures Deep memory hierarchy Power constrains Need for scalable, reliable performance and capability to rapidly adapt to new HW Applications have also become complex In-situ analysis, workflows Sophisticated monitoring and tools support, etc... Isolated, consistent simulation performance Dependence on POSIX, MPI and OpenMP Seemingly contradictory requirements... 3 HermitCore Stefan Lankes et al. WTH Aachen University

4 Operating System = Overhead Every administration layer has its overhead e. g. Hourglass benchmark 10 6 Number of events Loop time (cycles) 4 HermitCore Stefan Lankes et al. WTH Aachen University

5 Operating System = Overhead Every administration layer has its overhead e. g. Hourglass benchmark 10 6 Number of events Loop time (cycles) How does the HPC / Cloud Community reduce overhead? 4 HermitCore Stefan Lankes et al. WTH Aachen University

6 How does the HPC community reduce the overhead? Light-weight Kernels Typically taken of an existing fat kernel (e. g. Linux) emovalof unneeded features to improve the scalability e. g. ZeptoOS Multi-Kernels A specialized kernel runs side-by-side to full-weight kernel (e. g. Linux) Applications of the full-weight kernels are able to run on the specialized kernel. The specialized kernel catch every system call and delegate them to the full-weight kernel Binary compatible to the full-weight kernel Examples: mos, McKernel 5 HermitCore Stefan Lankes et al. WTH Aachen University

7 OS Designs for Cloud Computing Usage of Common OS Application Application OS eth0 OS eth0 Hypervisor Software Virtual Switch Operating System eth0 Two operating systems to maintain a single computer? Double Management! Why does a Cloud base on a Multi-User-/Multi-Tasking OS? 6 HermitCore Stefan Lankes et al. WTH Aachen University

8 OS Designs for Cloud Computing LibraryOS Application Application libos eth0 libos eth0 Hypervisor Software Virtual Switch Operating System eth0 Now, every system call is a function call Low overhead Whole optimization of an image possible (including the library OS) Link Time Optimization (LTO) emoving unneeded code reduces the attack vector 7 HermitCore Stefan Lankes et al. WTH Aachen University

9 elated Unikernels ump kernels 1 Part of NetBSD e. g., NetBSD s TCP / IP stack is available as library Strong dependencies to the hypervisor Not directly bootable on a standard hypervisor (e. g., KVM) IncludeOS 2 uns natively on the hardware minimal Overhead Only 32 bit support to avoid the overhead of paging uncommon in HPC MirageOS 3 Designed for the high-level language OCaml uncommon in HPC 1 A. Kantee and J. Cormack. ump Kernels No OS? No Problem!. In: ; login: A. Bratterud et al. IncludeOS: A esource Efficient Unikernel for Cloud Services. In: 7 th Int. Conference on Cloud Computing Technology and Science A. Madhavapeddy et al. Unikernels: Library Operating Systems for the Cloud. In: 8 th Int. Conference on Architectural Support for Programming Languages and Operating Systems HermitCore Stefan Lankes et al. WTH Aachen University

10 HermitCore Design Combination of the Unikernel and Multi-Kernel to reduce the overhead The same binary is able to run in a VM (classical unikernel setup) or bare-metal side-by-side to Linux (multi-kernel setup) Support for dominant programming models (MPI, OpenMP) Single-address space operating system No TLB Shootdown 9 HermitCore Stefan Lankes et al. WTH Aachen University

11 HermitCore Design Classical Unikernel Setup Applications are able to boot directly within a VM Tested with Qemu / KVM Tested with uhyve (experimental KVM-based Hypervisor) Qemu emulates more than HermitCore needs large setup time uhyve reduce the boot time (from 2 s to ~30 ms) Amazon Web Services (available soon!) Multi-Kernel Setup One kernel per NUMA node Only local memory accesses (UMA) Message passing between NUMA nodes One FWK (Linux) in the system to get access to a broader driver support Only a backup for pre- / post-processing Critical path should be handled by HermitCore 10 HermitCore Stefan Lankes et al. WTH Aachen University

12 Booting HermitCore On the detection of a HermitCore app, a proxy will be started. Proxy libc Linux kernel Hardware 11 HermitCore Stefan Lankes et al. WTH Aachen University

13 Booting HermitCore Proxy On the detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. libc Linux kernel Hardware 11 HermitCore Stefan Lankes et al. WTH Aachen University

14 Booting HermitCore App OpenMP / MPI Newlib libos (LwIP, IQ, etc.) Proxy libc Linux kernel On the detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. Hardware 11 HermitCore Stefan Lankes et al. WTH Aachen University

15 Booting HermitCore App OpenMP / MPI Newlib libos (LwIP, IQ, etc.) Proxy libc Linux kernel On the detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established. Hardware 11 HermitCore Stefan Lankes et al. WTH Aachen University

16 Booting HermitCore Hardware Proxy libc Linux kernel On the detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established. By termination, the cores are set to the HALT state. 11 HermitCore Stefan Lankes et al. WTH Aachen University

17 Booting HermitCore Proxy libc Linux kernel Hardware On the detection of a HermitCore app, a proxy will be started. The proxy unplugs a set of cores. Triggers Linux to boot HermitCore on the unused cores. A reliable connection will be established. By termination, the cores are set to the HALT state. Finally, reregistering of the cores to Linux. 11 HermitCore Stefan Lankes et al. WTH Aachen University

18 untime Support SSE, AVX2, AVX512, FMA,... Full C-library support (newlib) HBM support similar to memkind IP interface & BSD sockets (LwIP) IP packets are forwarded to Linux Shared memory interface Pthreads Thread binding at start time No load balancing less housekeeping OpenMP via Intel s untime icce- & MPI (via SCC-MPICH) Full support for the Go runtime Tile Core 23 L2$ MC MC 3 MIU Core 22 MPB L2$ outer MC MC 2 FPGA 12 HermitCore Stefan Lankes et al. WTH Aachen University

19 Operating System Micro-Benchmarks Test systems Intel Haswell CPUs (E v3) clocked at 2.3 GHz Intel KNL (Phi 7210) clocked at 1.3 GHz, SNC mode with four NUMA nodes esults in CPU cycles System activity KNL Haswell HermitCore Linux HermitCore Linux getpid() sched_yield() malloc() first write access to a page HermitCore Stefan Lankes et al. WTH Aachen University

1 105 Standard Linux 1 105 Linux with enabled isolcpus & nohz 0.8 0.8 Gap in Ticks 0.6 0.

2 0 0 50 100 150 200 250 300 Time in s 1 105 HermitCore 0 0 50 100 150 200 250 300 Time in s 1

20 1 105 Standard Linux Linux with enabled isolcpus & nohz Gap in Ticks Gap in Ticks Time in s HermitCore Time in s HermitCore without IP thread Gap in Ticks Gap in Ticks Time in s Time in s

21 Overhead of VMs Determined via NAS Parallel Benchmarks (Class B) Native (Linux) HermitCore + uhyve 20 Speed in GMop/s CG BT SP EP IS 15 HermitCore Stefan Lankes et al. WTH Aachen University

22 Conclusions It works! Binary packages are available educe the OS noise significantly Try it out! Thank you for your kind attention! 16 HermitCore Stefan Lankes et al. WTH Aachen University

23 Backup slides

24 Outlook A fast direct access to the interconnect is required S-IOV simplifies the coordination between Linux & HermitCore HermitCore Core Core Memory Node 2 Linux Core Core Memory Node 0 NIC vnic Virtual IP Device / Message Passing Interafce HermitCore Core Core Memory Node 3 HermitCore Core Core Memory Node 1 vnic vnic 18 HermitCore Stefan Lankes et al. WTH Aachen University

25 OS Designs for Cloud Computing Container Application Application OS Container eth0 Container Building of virtual borders (namespaces) Containers and their processes doesn t see each other Fast access to OS services Less secure because an exploit for the container attacks also the host OS Doesn t reduce the OS noise of the host system 19 HermitCore Stefan Lankes et al. WTH Aachen University

26 EPCC OpenMP Micro-Benchmarks Overhead in µs Parallel on Linux (gcc) Parallel For on Linux (gcc) Parallel on Linux (icc) Parallel For on Linux (icc) Parallel on HermitCore Parallel For on HermitCore Number of Threads 20 HermitCore Stefan Lankes et al. WTH Aachen University

27 Throughput esults of the Inter-kernel Communication Layer Throughput in GiB/s PingPong via icce PingPong via SCC-MPICH PingPong via ParaStation MPI Ki 64 Ki 1 Mi Size in Byte 21 HermitCore Stefan Lankes et al. WTH Aachen University

28 Lack of programmability Non-Uniform Memory Access Costs for memory access may vary un processes where memory is allocated Allocate memory where the process resides Implications for the performance Where should the applications store the data? Who should decide the location? The operating system? The application developers? Socket 0 Socket 1 Interconnect Memory 0 Memory 1 22 HermitCore Stefan Lankes et al. WTH Aachen University

29 Lack of programmability Non-Uniform Memory Access Costs for memory access may vary un processes where memory is allocated Allocate memory where the process resides Implications for the performance Where should the applications store the data? Who should decide the location? The operating system? The application developers? Memory 2 Memory 3 Interconnect Memory 0 Memory 1 22 HermitCore Stefan Lankes et al. WTH Aachen University

30 Tuning Tricks Parallelization via Shared Memory (OpenMP) Many side-effects and error-prone Incremental parallelization Parallelization via Message Passing (MPI) estructuring of the sequential code Less side-effects Performance Tuning Bind MPI applications on one NUMA node No remote memory access Speed in GFlop/s LU-MZ.C.16 1 BT-MZ.C.16 SP-MZ.C x8 4x4 8x2 16x1 < threadcount > < proccount > 23 HermitCore Stefan Lankes et al. WTH Aachen University

31 OpenMP untime GCC includes a OpenMP untime (libgomp) euse synchronization primitives of the Pthread library Other OpenMP runtimes scales better In addition, our Pthread library was originally not designed for HPC Integration of Intel s OpenMP untime Include its own synchronization primitives Binary compatible to GCC s OpenMP untime Changes for the HermitCore support are small Mostly deactivation of function to define the thread affinity Transparent usage For the end-user, no changes in the build process 24 HermitCore Stefan Lankes et al. WTH Aachen University

32 Support of compilers beside GCC Just avoid the standard environment ( ffreestanding) Set include path to HermitCore s toolchain Be sure that the ELF file use HermitCore s ABI Patching object files via elfedit Use the GCC to link the binary LD = x86_64 - hermit - gcc #CC = x86_64 - hermit - gcc # CFLAGS = -O3 -mtune = native -march = native - fopenmp -mno -red - zone CC = icc - D hermit CFLAGS = -O3 -xhost -mno -red - zone - ffreestanding -I$( HEMIT_DI ) -openmp ELFEDIT = x86_64 - hermit - elfedit stream.o: stream.c $(CC) $( CFLAGS ) -c -o $@ $< $( ELFEDIT ) -- output - osabi HermitCore $@ stream : stream.o $(LD) -o $@ $< $( LDFLAGS ) $( CFLAGS ) 25 HermitCore Stefan Lankes et al. WTH Aachen University

33 Is the software stack difficult to maintain? Changes to the common software stack determined with cloc Software Stack LoC Changes binutils gcc Linux Newlib LwIP Pthread OpenMP T HermitCore HermitCore Stefan Lankes et al. WTH Aachen University

34 Number of events Linux Number of events Hermit w LwIP Loop time (cycles) Loop time (cycles) Number of events Linux (isolcpu) Number of events Hermit wo LwIP Loop time (cycles) Loop time (cycles)

35 Hydro (preliminary results) MFLOPS 10,000 5,000 Linux (1 process n threads) Linux (1 proc. n thr., bind-to 0 19) Linux (n proc. 5 thr., bind-to numa) HermitCore (n proc. 5 thr.) Number of Cores 28 HermitCore Stefan Lankes et al. WTH Aachen University

36 Thank you for your kind attention! Stefan Lankes et al. Institute for Automation of Complex Power Systems E.ON Energy esearch Center, WTH Aachen University Mathieustraße Aachen

High Performance Computing Systems

High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: Full weight kernel Lightweight kernel Why not both? How does implementation affect usage and performance?