RADU POPESCU IMPROVING THE WRITE SCALABILITY OF THE CERNVM FILE SYSTEM WITH ERLANG/OTP
THE EUROPEAN ORGANISATION FOR PARTICLE PHYSICS RESEARCH (CERN) 2 THE LARGE HADRON COLLIDER
THE LARGE HADRON COLLIDER 3 TUNNEL VISION 27 km circumference 100 m underground 180MW power consumption 7 TeV per beam
THE LARGE HADRON COLLIDER ALICE, ATLAS, CMS AND LHCB DETECTORS 4
THE LARGE HADRON COLLIDER 5 CMS DETECTOR INNER BARREL
THE LARGE HADRON COLLIDER 6 SUPER COLLIDER Super Collider, Mustaine et al. (2013, Universal Records)
THE LARGE HADRON COLLIDER 7 EXPERIMENT DATA CHALLENGE 100 Million channels, bunch crossing every 25 ns 1 PB/s internal data rate 5 PB data / year recorded (plus derived data sets) 100 PB / year by 2025 (x20) 5 million lines of code / experiment
THE WORLDWIDE LHC COMPUTING GRID GLOBALLY DISTRIBUTED Worldwide LHC Compute Grid live map 42 countries, 170 computing centres, 2 million jobs run each day 8
LHC EXPERIMENT SOFTWARE STACKS 9 KEY FIGURES Hundreds of developers ~10^8 binaries ~1TB / day of nightly builds ~100 000 machines world-wide Daily production releases, remain available
THE CERNVM FILE SYSTEM 10
THE CERNVM FILE SYSTEM 11 A FILE SYSTEM APPROACH TO DISTRIBUTING SOFTWARE BASIC SYSTEM UTILITIES OS KERNEL CERNVM FS FUSE GLOBAL HTTP CACHE HIERARCHY FILE SYSTEM MEMORY BUFFER (~100MB) CERNVM-FS PERSISTENT CACHE (~20GB) REPOSITORY (HTTP OR S3) ~1-10TB ~100 000 clients FUSE based, independent mount points, e. g. /cvmfs/atlas.cern.ch Clients have a read-only view; single writer into repository HTTP transport, access and caching on demand
THE CERNVM FILE SYSTEM 12 MAIN COMPONENTS Client: FUSE module (with cache plugins) Server tools (command line tools) Standard HTTP server HTTP caches
THE CERNVM FILE SYSTEM 13 DESIGN Data store: Immutable content-addressed blobs (*) Compression, deduplication Metadata: Catalogs: state of the entire repository at a given moment in time is encoded in a Merkle tree Digitally signed manifest Versioning, snapshots etc. PULL based!
CVMFS PUBLICATION WORKFLOW 14
THE CERNVM FILE SYSTEM 15 PUBLISHING Single writer (stateless command line utilities) A read/write view is constructed with a union mount (OverlayFS, Aufs) Files are compressed and hashed, and written to repository storage New metadata catalogs are created and published Repository manifest is updated (atomic operation)
PUBLISHING TO CVMFS REPOSITORIES 16 EXISTING WORKFLOW Centralised release manager machine Direct interaction with the release manager: $ ssh my-cvmfs-server.cern.ch $ cvmfs_server transaction $ vim /cvmfs/my-cvmfs-server.cern.ch/some_file.org (Make changes to files in the R/W mount) $ cvmfs_server publish
PUBLISHING TO CVMFS REPOSITORIES 17 EXISTING WORKFLOW PROS: Straightforward to use Good for scripting Hides somewhat the distributed nature of the system CONS: No support for concurrent writing Can be unsafe (shell access to machine with repository storage) Performance issues for large change-sets
PUBLISHING TO CVMFS REPOSITORIES 18 PROPERTIES AND CONSTRAINTS 1. The system (repository + cache + clients) is eventually consistent 2. Concurrency can be further exploited due to: Immutability of CAS Pushing objects is idempotent Directory tree structure 3. The critical section involves updating the metadata catalog and swapping the manifest
PUBLISHING TO CVMFS REPOSITORIES 19 EXISTING ARCHITECTURE HTTP SSH NFS, S3 USER MACHINE RELEASE MANAGER AND GATEWAY CVMFS FUSE CVMFS SERVER AUTHORITATIVE STORAGE STRATA 1
PUBLISHING TO CVMFS REPOSITORIES 20 AN IMPROVED ARCHITECTURE USER MACHINE RELEASE MANAGER AND GATEWAY CVMFS FUSE CVMFS SERVER CVMFS SERVICE API Gateway CVMFS Gateway Services CVMFS Gateway Services CVMFS Services STORAGE GATEWAY USER MACHINE RELEASE MANAGER AND GATEWAY CVMFS FUSE CVMFS SERVER REPLICAS AUTH. STORAGE STRATA 1
PUBLISHING TO CVMFS REPOSITORIES 21 AN IMPROVED WORKFLOW $ ssh my-cvmfs-1.cern.ch $ cvmfs_server transaction /lcg/58 (Make changes to files in the R/W mount) $ vim /cvmfs/my-cvmfs.cern.ch/lcg/58/ some_file.org $ cvmfs_server publish $ ssh my-cvmfs-2.cern.ch $ cvmfs_server transaction /lcg/60 (Make changes to files in the R/W mount) $ vim /cvmfs/my-cvmfs.cern.ch/lcg/60/ some_file.org $ cvmfs_server publish
CVMFS SERVICE ARCHITECTURE 22 CVMFS STORAGE GATEWAY Serves as a distributed lock manager Checks rights of clients to modify repositories Assigns exclusive leases to clients on repository subpaths Receives files (object packs) from clients, writes them to authoritative storage
CVMFS SERVICES IMPLEMENTATION 23 ERLANG/OTP: DISTRIBUTED GLUE Language (Erlang) and framework (OTP) designed for concurrent and distributed applications: Actor model: lightweight processes with memory isolation Immutability of values Supervision trees Erlang/OTP/BEAM are battle-tested, 30+ years of use at Ericsson Excellent C/C++ interoperability
CVMFS SERVICES IMPLEMENTATION 24 GATEWAY APPLICATION ARCHITECTURE HTTP FRONT-END (COWBOY) BACK-END (MULTIPLEXER) AUTH LEASE RECEIVER (WORKER POOL) PERSIST (MNESIA) WORKER (C++) WORKER (C++) WORKER (C++)
CVMFS SERVICES IMPLEMENTATION 25 DEVELOPER EXPERIENCE WITH ERLANG/OTP Great: OTP Tracing, inspection etc. Immutability, Functional language Very simple to write concurrent programs Use Dialyzer, CommonTest, QuickCheck etc. Easy integration with C++
CVMFS SERVICES IMPLEMENTATION 26 DEVELOPER EXPERIENCE WITH ERLANG/OTP Less great: Dynamic typing is strange, coming from C++ Deciphering Erlang errors is an acquired taste (use Lager for logging) Large APIs in OTP, some parts feel less clearly documented
TEXT 27 DEVELOPER EXPERIENCE WITH ERLANG/OTP Overall impression is very positive! Would definitely use it for other new components Looking forward to more operational experience
OTHER CERNVM-FS PROJECTS AND ACTIVITIES 28 DOCKER GRAPHDRIVER PLUGIN Docker Graphdriver plugin for CernVM-FS (Nikola Hardi): https://github.com/cvmfs/docker-graphdriver Store the contents of Docker image layers inside CernVM FS repositories Instead of having to download the entire layers, mount a CernVM FS repository and download individual files on-demand
OTHER CERNVM-FS PROJECTS AND ACTIVITIES 29 CERN VM 10TH ANNIVERSARY! Next year, CernVM is turning 10 Jan 30th -> Feb 1st 2018: CernVM workshop @CERN Open to anyone Talks by users and developers of CernVM and related projects
30 THE CERNVM TEAM (LEFT TO RIGHT) Radu Popescu Jakob Blomer Gerardo Ganis Petr Jirout (former) Nikola Hardi (former)
TEXT 31 THANK YOU CernVM-FS: https://github.com/cvmfs http://cvmfs.readthedocs.io/en/stable/ radu.popescu@cern.ch, https://github.com/radupopescu, @iradupopescu
ERLANG/OTP CONCURRENCY PATTERNS 32 CRITICAL SECTIONS Erlang (only) provides processes and message passing for concurrency No locks, semaphores, condition variables etc. What if a exclusive access to a resource is needed? OTP gen_server works as a critical section
ERLANG/OTP CONCURRENCY PATTERNS 33 MULTIPLEXING REQUESTS/REPLIES ON GEN_SERVER OTP gen_server with concurrency? In gen_server:handle_call, spawn a process per request, and return {noreply, } The spawned process later returns a value with gen_server:reply. Does not maintain order of requests Concurrency adaptor between Cowboy and C++ worker pool