Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL

Similar documents
Changing landscape of computing at BNL

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

New Directions and BNL

Enabling web-based interactive notebooks on geographically distributed HPC resources. Alexandre Beche

AixViPMaP towards an open simulation platform for microstructure modelling

Monitoring and Analytics With HTCondor Data

PROOF-Condor integration for ATLAS

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

JupyterHub Documentation

OpenDreamKit. Computational environments for research and education Min Ragan-Kelley. Simula Research Lab

ALICE Grid Activities in US

Clouds at other sites T2-type computing

Clouds in High Energy Physics

Graham vs legacy systems

Bright Cluster Manager

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

ATLAS NorduGrid related activities

Parsl: Developing Interactive Parallel Workflows in Python using Parsl

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

Facilitating Collaborative Analysis in SWAN

Opportunities for container environments on Cray XC30 with GPU devices

SUG Breakout Session: OSC OnDemand App Development

Onto Petaflops with Kubernetes

Virtualization of the ATLAS Tier-2/3 environment on the HPC cluster NEMO

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

Day 9: Introduction to CHTC

XSEDE High Throughput Computing Use Cases

Reproducibility and Extensibility in Scientific Research. Jessica Forde

Singularity in CMS. Over a million containers served

HTCondor with KRB/AFS Setup and first experiences on the DESY interactive batch farm

arxiv: v1 [cs.dc] 7 Apr 2014

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

UW-ATLAS Experiences with Condor

Enabling a SuperFacility with Software Defined Networking

Programming with Python

UI and Python Interface

Our Workshop Environment

An overview of batch processing. 1-June-2017

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

JupyterHub Documentation

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Andrej Filipčič

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Current Status of the Ceph Based Storage Systems at the RACF

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

JupyterHub Documentation

HPC learning using Cloud infrastructure

Parallel Storage Systems for Large-Scale Machines

Slurm basics. Summer Kickstart June slide 1 of 49

Building a Virtualized Desktop Grid. Eric Sedore

Introduction to BioHPC

Analytics Platform for ATLAS Computing Services

A Container On a Virtual Machine On an HPC? Presentation to HPC Advisory Council. Perth, July 31-Aug 01, 2017

Batch Services at CERN: Status and Future Evolution

Users and utilization of CERIT-SC infrastructure

Developing Microsoft Azure Solutions

SCALABLE HYBRID PROTOTYPE

HTCondor Essentials. Index

Developing Microsoft Azure Solutions (70-532) Syllabus

The Portal Aspect of the LSST Science Platform. Gregory Dubois-Felsmann Caltech/IPAC. LSST2017 August 16, 2017

Using jupyter notebooks on Blue Waters. Roland Haas (NCSA / University of Illinois)

XSEDE New User Training. Ritu Arora November 14, 2014

Introduction to High-Performance Computing (HPC)

The GISandbox: A Science Gateway For Geospatial Computing. Davide Del Vento, Eric Shook, Andrea Zonca

Scheduling Computational and Storage Resources on the NRP

PROOF as a Service on the Cloud: a Virtual Analysis Facility based on the CernVM ecosystem

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Presented by: Jon Wedell BioMagResBank

The BOINC Community. PC volunteers (240,000) Projects. UC Berkeley developers (2.5) Other volunteers: testing translation support. Computer scientists

Tutorial 4: Condor. John Watt, National e-science Centre

Shifter: Fast and consistent HPC workflows using containers

Alteryx Technical Overview

Introduction to Condor. Jari Varje

Monitoring HTCondor with the BigPanDA monitoring package

Exam Questions

Workstations & Thin Clients

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

The Materials Data Facility

Flying HTCondor at 100gbps Over the Golden State

Automatic Dependency Management for Scientific Applications on Clusters. Ben Tovar*, Nicholas Hazekamp, Nathaniel Kremer-Herman, Douglas Thain

Benchmarking the ATLAS software through the Kit Validation engine

WLCG Lightweight Sites

NI Linux Real-Time. Fanie Coetzer. Field Sales Engineer SA North. ni.com

Introduction to SciTokens

USING NGC WITH GOOGLE CLOUD PLATFORM

ElastiCluster Automated provisioning of computational clusters in the cloud

LHConCRAY. Acceptance Tests 2017 Run4 System Report Miguel Gila, CSCS August 03, 2017

Supporting GPUs in Docker Containers on Apache Mesos

Working with Shell Scripting. Daniel Balagué

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Leveraging the Globus Platform in your Web Applications. GlobusWorld April 26, 2018 Greg Nawrocki

HDP Security Overview

HDP Security Overview

Dask-jobqueue Documentation

Developing Microsoft Azure Solutions (70-532) Syllabus

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

Metview s new Python interface first results and roadmap for further developments

Transcription:

Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL Ofer Rind Spring HEPiX, Madison, WI May 17,2018 In collaboration with: Doug Benjamin, Costin Caramarcu, Zhihua Dong, Will Strecker-Kellogg, Thomas Throwe

BNL SDCC Serves an increasingly diverse, multi-disciplinary user community: RHIC Tier-0, US ATLAS Tier-1 and Tier-3, Belle2 Tier-1, Neutrino, Astro, LQCD, CFN,. Large HTC infrastructure accessed via HTCondor (plus experiment-specific job management layers) Growing HPC infrastructure, currently with two production clusters accessed via Slurm Limited interactive resources accessed via ssh gateways

Interactive Data Analysis Wish list for running effective, interactive data analysis in an era of large-scale computing, with complex software stacks.? Lower the barrier to entry for using data analysis resources at BNL Minimize or eliminate software setup and installation Need flexible, easy-to-follow examples/tutorials Simple way to document and share results and code Reproducibility, adaptability Straightforward way to make use of software methods and ecosystems being developed in non-hep communities (e.g. machine learning) And make our resources more easily available to non-hep communities

Data Analysis As A Service Jupyter Notebooks (ipython) Provide a flexible, standardized, platform independent interface through a web browser No local software to install Many language extensions (kernels) and tools available Easy to share, reproduce, document results and create tutorials From the facility point of view: Can we implement this by leveraging existing resources? Would prefer to avoid building new dedicated infrastructure, such as a specialized cluster (cf. CERN Swan)

Some terminology Jupyter notebook: web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results Jupyterlab: nextgeneration webbased user interface

Some terminology Jupyterhub: multi-user hub, spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server

Current Test Setup at BNL Jupyterhub servers deployed on RHEV Anaconda3 install Varying environments/networks depending on function: HTC/ Condor or HPC/Slurm Access via ssh tunnel through firewall to Jupyterhub https proxy Kerberos auth to Jupyterhub server Transparent setup leverages PAM stack Working on OAuth implementation Jira, confluence, etc. for documentation Temporary two node slurm reservation on Institutional Cluster (Broadwell/Nvidia) for testing How to connect users to the batch resources?

One approach: slurm_magic Execute usual CLI batch commands through notebook interface. https://github.com/nersc/slurm-magic Easily adapted for HTCondor as well. But not so satisfying could just open a terminal to do this anyway

More useful approach: HTCondor API Provide access to distributed computing through familiar APIs (python s threading, multiprocessing, asyncio, etc ) I d like to submit and manage a job or cluster of jobs

More useful approach: HTCondor API At a higher level, abstract away the batch job layer I d like to run over a dataset Serialize function, ship off to jobs, serialize output, gather Early stage of development - see Will Strecker-Kellogg for details

Yet another approach: batchspawner Spawn ipython notebook itself within a single node batch job Notebook can be spawned locally or onto batch system, with connection established back to hub and to browser through the http proxy batchspawner.py with hooks for slurm, HTCondor, torque, etc. https://github.com/jupyterhub/batchspawner Also, wrapspawner.py, which allows for selection between multiple profile setups at startup. https:// github.com/jupyterhub/wrapspawner Anaconda installation for batch users on shared gpfs volume

Slurm jupyterhub_config.py HTCondor + Profiles #------------------------------------------------------------------------------ # BatchSpawner(Spawner) configuration # Using Slurm to Spawn user to IC #------------------------------------------------------------------------------ c.jupyterhub.spawner_class = 'batchspawner.slurmspawner' #------------------------------------------------------------------------------ # BatchSpawnerBase configuration # These are simply setting parameters used in the job script template below #------------------------------------------------------------------------------ c.batchspawnerbase.req_nprocs = '1' c.batchspawnerbase.req_partition = 'long' c.batchspawnerbase.req_runtime = '120:00' c.batchspawnerbase.req_account = 'pq302951' c.slurmspawner.batch_script = '''#!/bin/sh #SBATCH --partition={partition} #SBATCH --time={runtime} #SBATCH --job-name=spawner-jupyterhub #SBATCH --workdir={homedir} #SBATCH --export={keepvars} #SBATCH --get-user-env=l #SBATCH --account={account} #SBATCH --reservation=racf_32 #SBATCH {options} {cmd} ''' #------------------------------------------------------------------------------ # ProfilesSpawner configuration #------------------------------------------------------------------------------ # List of profiles to offer for selection. Signature is: # List(Tuple( Unicode, Unicode, Type(Spawner), Dict )) # corresponding to profile display name, unique key, Spawner class, # dictionary of spawner config options. # # The first three values will be exposed in the input_template as {display}, # {key}, and {type} # c.jupyterhub.spawner_class = 'wrapspawner.profilesspawner' c.profilesspawner.profiles = [ ( "Local server", 'local', 'jupyterhub.spawner.localprocessspawner', {'ip':'0.0.0.0'} ), ('Condor Shared Queue', 'CondorDefault', 'batchspawner.condorspawner', dict(req_nprocs='20', req_memory='18g', req_options='+job_type = "jupyter"')), ] c.condorspawner.batch_script = ''' Executable = /bin/sh RequestMemory = {memory} RequestCpus = {nprocs} Arguments = \"-c 'export PATH=/u0b/software/anaconda3/bin:$PATH; exec {cmd}'\" Remote_Initialdir = {homedir} Output = {homedir}/.jupyterhub.$(clusterid).condor.out Error = {homedir}/.jupyterhub.$(clusterid).condor.err ShouldTransferFiles = False GetEnv = True PeriodicRemove = (JobStatus == 1 && NumJobStarts > 1) {options} Queue '''

Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) LocalProcessSpawner HTTP Proxy Spawner Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) LocalProcessSpawner HTTP Proxy Spawner Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) LocalProcessSpawner HTTP Proxy Spawner Slurm_magic HTCondor API Batch Scheduler Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) LocalProcessSpawner HTTP Proxy Spawner Slurm_magic HTCondor API Batch Scheduler sbatch, condor_submit Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner Batch Scheduler Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner BatchSpawner Batch Scheduler Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner BatchSpawner Batch Scheduler sbatch, condor_submit Shared Storage Resources (GPFS, dcache, BNL Box)

Auth (krb, oauth, sso) Jupyterhub Server Compute Cluster (Slurm, HTCondor) HTTP Proxy Spawner BatchSpawner Batch Scheduler sbatch, condor_submit Shared Storage Resources (GPFS, dcache, BNL Box)

The Interface

The Interface

The Interface

The Interface

The Interface

The Interface

The Interface

The Interface

Example: ML Applications for ATLAS Tensorflow/Keras

Example: ML Applications for ATLAS Tensorflow/Keras 2018-05-08 17:00:47.493788: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA [ ] 2018-05-08 17:00:47.884117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryclockrate(ghz): 1.3285 pcibusid: 0000:07:00.0 totalmemory: 15.89GiB freememory: 15.60GiB 2018-05-08 17:00:48.180799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 1 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryclockrate(ghz): 1.3285 pcibusid: 0000:81:00.0 totalmemory: 15.89GiB freememory: 15.60GiB [ ] 2018-05-08 17:00:48.689637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/ job:localhost/replica:0/task:0/device:gpu:0 with 15128 MB memory) -> physical GPU (device: 0, name: Tesla P100- PCIE-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0) 2018-05-08 17:00:48.844497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/ job:localhost/replica:0/task:0/device:gpu:1 with 15128 MB memory) -> physical GPU (device: 1, name: Tesla P100- PCIE-16GB, pci bus id: 0000:81:00.0, compute capability: 6.0)

Open Issues Authentication (implementing oauth) Eliminate need for tunneling Allocation of resources Interactive users won t be patient with batch system latency - how do we handle scheduling on an oversubscribed cluster? What are appropriate time/resource limitations on the notebooks? How to handle idle notebooks taking up job slots? External connectivity requirements (esp. on HPC) Management of software environment (is anaconda the way to go?) Who are the users? HEP inreach and outreach (e.g. notebooks already heavily used at NSLS2)

Conclusions We have begun to overlay a flexible, Jupyter notebook-based analysis portal atop existing batch resources at BNL Providing the tools that users want for the new ways they work now Technical and policy issues remain Looking for users from the diverse communities we serve and other interested admins/developers to collaborate with

Let s discuss Must a question have answer? Can t there be another way? Would you like to talk about it?