Bare Metal Library. Abstractions for modern hardware Cyprien Noel

Size: px

Start display at page:

Download "Bare Metal Library. Abstractions for modern hardware Cyprien Noel"

Cody Moody
5 years ago
Views:

1 Bare Metal Library Abstractions for modern hardware Cyprien Noel

2 Plan Modern Hardware? New challenges & opportunities Three use cases Current solutions Leveraging hardware Simple abstraction

3 Myself High performance trading systems Lock-free algos, distributed systems H2O Distributed CPU machine learning, async SGD Flickr Scaling deep learning on GPU Multi GPU Caffe RDMA, multicast, distributed Hogwild CaffeOnSpark UC Berkeley NCCL Caffe, GPU cluster tooling Bare Metal

4 Modern Hardware?

5 Device-to-device networks

7 Moving from ms software to µs hardware Number crunching GPU FS, block io, virt mem Pmem Network stack RDMA RAID, replication Erasure codes Device mem Coherent fabrics And more: Video, crypto etc.

8 OS abstractions replaced by CUDA OFED Libpmem DPDK SPDK Libfabric UCX VMA More every week... More powerful, but also more complex and non-interoperable

9 Summary So Far Big changes coming! At least for high-performance applications CPU should orchestrate Not in critical path Device-to-device networks Retrofitting existing architectures difficult CPU-centric abstractions ms software on µs hardware (e.g. 100s instructions per packet) OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible

10 What do we do? Start from scratch? E.g. Google Fushia - no fs, block io, network etc. Very interesting but future work Use already accelerated frameworks? E.g. PyTorch, BeeGFS Not general purpose, no interop, not device-to-device Work incrementally from use cases Look for simplest hardware solution Hopefully useful abstractions will emerge

11 Use cases Build datasets Add, update elements Apply functions to sets, map-reduce Data versioning Training & inference Compute graphs, pipelines Deployment Model versioning

12 Datasets Typical solution Protobuf messages KV store Dist. file system Limitations Serialization granularity Copies: kv log, kernel1, replication, kernel2, fs Remote CPU involved, stragglers Cannot place data in device (x12)

13 EC shard Datasets Simplest hardware implementation Write protobuf in arena, like Flatbuffers Pick an offset on disks, e.g. a namespace Call ibv_exp_ec_encode_async Comments Management, coordination, crash resiliency Thin wrapper over HW: line rate perf. User abstraction? Simple, familiar Efficient, device friendly (x12)

14 mmap Extension to classic mmap Distributed Typed - Protobuf, other formats planned Protobuf is amazing Forward and backward compatible Lattice

15 mmap C++ const Test& test = mmap<test>("/test"); int i = test.field(); Python test = Test() bm.mmap("/test", test) i = test.field()

16 mmap, recap Simple abstraction for data storage Fully accelerated, mechanically friendly Thin wrapper over HW, device-to-device, zero copy ~1.5x replication factor Network automatically balanced Solves straggler problem No memory pinning or TLB thrashing, NUMA aware

17 Use cases Compute Map-reduce, compute graphs, pipelines Typical setup Spark, DL frameworks Distribution using Akka, grpc, MPI Kubernetes or SLURM scheduling Limitations No interop Placement difficult Inefficient resources allocation

18 Compute Simplest hardware implementation Define a task, e.g. img. resize, CUDA kernel, PyTorch graph Place tasks in queue Work stealing - RDMA atomics Device-to-device chaining - GPU Direct Async User abstraction?

19 task def compute(x, y): return x * y # Runs locally compute(1, 2) # Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2)

20 task, recap Simple abstraction for CPU and device kernels Work stealing instead of explicit schedule No GPU hoarding Better work balancing Dynamic placement, HA Device-to-device chaining Data placed directly in device memory Efficient pipelines, even very short tasks E.g. model parallelism, low latency inference

21 Use cases Versioning Track datasets and models Deploy / rollback models Typical setup Copy before update Symlinks as versions to data Staging / production environments split

22 Versioning Simplest hardware implementation Keep multiple write ahead logs mmap updates tasks queues User abstraction?

23 branch Like a git branch But any size data Simplifies collaboration, experimentation Generalized staging / production split Simplifies HA File system fsync, msync (Very hard! Rajimwale et al. DSN 11) Replaces transactions, e.g. queues, persistent memory Allows duplicate work merge

24 branch C++ Test* test = mutable_mmap<test>("/test"); branch b; # Only visible in current branch test->set_field(12); Similar in Python

25 Summary mmap, task, and branch simplify hardware-acceleration Helps build pipelines, manage cluster resources etc. Early micro benchmarks suggest very high performance

26 Thank You! Will be open sourced BSD Contact me if interested - cyprien.noel@berkeley.edu Thanks to our sponsor

Why AI Frameworks Need (not only) RDMA?

Why AI Frameworks Need (not only) RDMA? With Design and Implementation Experience of Networking Support on TensorFlow GDR, Apache MXNet, WeChat Amber, and Tencent Angel Bairen Yi (byi@connect.ust.hk) Jingrong