CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON. Created by Moritz Wundke

Similar documents
HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Parallel Python using the Multiprocess(ing) Package

Master-Worker pattern

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Master-Worker pattern

Cycle Sharing Systems

Scaling DreamFactory

Overview of Project's Achievements

Scaling up MATLAB Analytics Marta Wilczkowiak, PhD Senior Applications Engineer MathWorks

Heckaton. SQL Server's Memory Optimized OLTP Engine

ØMQ and PyØMQ. Simple and Fast Messaging. Brian Granger SciPy 2010

The Google File System

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

multiprocessing and mpi4py

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

How Container Runtimes matter in Kubernetes?

CHAPTER 3 - PROCESS CONCEPT

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

CIS192 Python Programming

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

Parallel Computer Architecture and Programming Final Project

THE DATACENTER AS A COMPUTER AND COURSE REVIEW

Lecture 23 Database System Architectures

PAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Internal Server Architectures

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Chapter 20: Database System Architectures

Lecture 8: February 19

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Chapter 3 Processes. Process Concept. Process Concept. Process Concept (Cont.) Process Concept (Cont.) Process Concept (Cont.)

Administrivia. HW1 due Oct 4. Lectures now being recorded. I ll post URLs when available. Discussing Readings on Monday.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Lecture 7: February 10

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

What s New in MySQL 5.7 Geir Høydalsvik, Sr. Director, MySQL Engineering. Copyright 2015, Oracle and/or its affiliates. All rights reserved.

Today s Agenda. Today s Agenda 9/8/17. Networking and Messaging

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Send me up to 5 good questions in your opinion, I ll use top ones Via direct message at slack. Can be a group effort. Try to add some explanation.

Παράλληλη Επεξεργασία

Introduction to parallel Computing

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Performance analysis, development and improvement of programs, commands and BASH scripts in GNU/Linux systems

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

ApsaraDB for Redis. Product Introduction

(h)icn Socket Library for HTTP Leveraging (h)icn socket library for carrying HTTP messages

Multi-Channel Clustered Web Application Servers

Moore s Law. Computer architect goal Software developer assumption

Big Orange Bramble. August 09, 2016

Filesystem Performance on FreeBSD

Introduction to Parallel Computing

Strong Scaling for Molecular Dynamics Applications

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Parallel Algorithm Design. CS595, Fall 2010

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Gables: A Roofline Model for Mobile SoCs

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Building a Scalable Architecture for Web Apps - Part I (Lessons Directi)

Core Development > PEP Index > PEP Addition of the multiprocessing package to the standard library

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

XtreemFS: highperformance network file system clients and servers in userspace. Minor Gordon, NEC Deutschland GmbH

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Transformer Looping Functions for Pivoting the data :

Chapter 5. The MapReduce Programming Model and Implementation

ICO Platform Load Testing Scandiweb

Async Programming & Networking. CS 475, Spring 2018 Concurrent & Distributed Systems

CUDA GPGPU Workshop 2012

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Parallel Computing Concepts. CSInParallel Project

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OASIS: Self-tuning Storage for Applications

Adaptive Cluster Computing using JavaSpaces

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

Parallelizing CI using Docker Swarm-Mode

Parallelism paradigms

BUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR. Petri Kero CTO / Ministry of Games

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)


Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High-Performance Data Loading and Augmentation for Deep Neural Network Training

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Kernel Types Simple OS Examples System Calls. Operating Systems. Autumn CS4023

List of lectures. Lecture content. Concurrent computing. TDDA69 Data and Program Structure Concurrent Computing Cyrille Berger

Architecting ArcGIS Server Solutions for Performance and Scalability

How Scalable is your SMB?

Hybrid Implementation of 3D Kirchhoff Migration

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Memory-Based Cloud Architectures

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Transcription:

CONCURRENT DISTRIBUTED TASK SYSTEM IN PYTHON Created by Moritz Wundke

INTRODUCTION Concurrent aims to be a different type of task distribution system compared to what MPI like system do. It adds a simple but powerful application abstraction layer to distribute the logic of an entire application onto a swarm of clusters holding similarities with volunteer computing systems.

INTRODUCTION CONT'D Traditional task distributed systems will just perform simple tasks onto the distributed system and wait for results. Concurrent goes one step further by letting the tasks and the application decide what to do. The programming paradigm is then totally async without any waits for results and based on notifications once a computation has been performed.

WHY? Flexible Stable Rapid development Python

WHY? - FLEXIBLE Keystone for every framework Must be reused in many typed of projects Easy to integrate in existing code

WHY? - STABLE Fault tolerance Consistency We scarify availability

WHY? - RAPID DEVELOPMENT Fast prototyping Python is fast to write, very fast! Deliver soon and often principle

WHY? - PYTHON Huge community Used in scientific computation Deliver soon and often principle Imperative and functional Very organized Cross-Platform Object serialization through pickle, thus dangerous if not used properly! Drawback: pure python performance

THE DARK SIDE Pure python performance Cython Native No real multithread The GIL issue Releasing the GIL manually multiprocessing (fork)

THE DARK SIDE - PERF Python is slow, thats a fact But we can boost it using natives Cython: Static C compiler combining both python flexibility and C performance. Native c modules: Create python modules directly in C.

THE DARK SIDE - GIL Global Interpreter Lock: Only one line or python object accessed at a time per process. We can release the GIL using natives like Cython or directly in a native module. We can also use processes instead of threads, while adding the need for IPC mechanisms. Shared vs Distributed memory / Threads vs Processes.

CONCURRENT Distributed task execution framework which tries to solve the GIL issue. Does not use threads when executing processes. Features different ways to implement IPC calls. All nodes in the system communicate through RPC calls or using HTTP or low-level TCP. Integrated Cython for performance tweaking.

OTHER FRAME WORKS Dispy: Fork based system, not applciation or cloud oriented as concurrent, problems with TCP congestion. ParallelPython: Thread based based system, not applciation or cloud oriented as concurrent, problems with TCP congestion. Superpy: Similar to concurrent but does not feature a highperformance transport layer. Only for windows. More libraries

ARCHITECTURE Concurrent is build upon a flexible plug-able component framework. Most of the framework is plug-able in few steps and can be tweaked installing plug-ins. Applications themselves are plug-ins that are then load on the environment and executed.

COMPONENTS Components are singleton instances per ComponentManager. They implement the functionality of a given Interface and talk to each other through an ExtensionPoint.

COMPONENTS CONT'D Example setting of Components linked together with ExtensionPoints

COMPONENTS CONT'D Configuration system allows us to configure ExtensionPoints via config files.

NODES Our distributed system is based on a 3 node setup, while more classes are involved for flexibility MasterNode: Our main master (or a layer within a multimaster setup). Distributed the workload and maintains the distributed system. SlaveNode: A slave node is connected to a master (or a set of masters in a multi-master setup). Executes the workload a master sends to this node or requests work from it. ApplicationNode: An application using the framework and sending work to it. Usually connected to a single master (or multiple masters on a multi-master setup).

NODES CONT'D

TASK SCHEDULING Concurrent comes with two task scheduling policies, one optimized for heterogeneous systems and another for homogeneous systems. Generic: For heterogeneous systems. Sends work to the best slave for the given work. Comes with slightly more overhead. ZMQ: ZMQ push/pull scheduling, for homogenous systems. Slave requests task from a global socket. Less overhead but prone to stalls if hardware is not the same on all slaves.

TASK SCHEDULING CONT'D

TASK SCHEDULING CONT'D From previous slide Generic scheduling execution flow. A GenericTaskScheduler uses a GenericTaskScheduleStrategy to send work to a GenericTaskManager of the target slave.

TASK SCHEDULING CONT'D

TASK SCHEDULING CONT'D From previous slide ZMQ push/pull scheduling execution flow. A ZMQTaskScheduler pushes work onto a global work socket. The ZMQTaskManagers of the slaves pull that work from it and perform the processing. The result is then pushed back to the master node.

TRANSPORT Concurrent comes with a complex transport module that features TCP and ZMQ sockets clients, servers and proxies. TCPServer: multithreaded TCP socket server. TCPServerZMQ: multithreaded ZMQ server using a limited number of pooled workers. TCPClient: TCP client used to establish connection to a TCPSocket. Proxies: proxies are used to implement RPC like calling mechanisms between servers and clients. TCPHandler: container handling registration of RPC methods.

TRANSPORT CONT'D

TRANSPORT CONT'D Registering RPC methods is straightforward in concurrent. Just register a method with the given server or client instance. @jsonremote(self.api_service_v1) def register_slave(request, node_id, port, data): self.stats.add_avg('register_slave') return self.register_node(node_id, web.ctx['ip'], port, data, NodeType.sla @tcpremote(self.zmq_server, name='register_slave') def register_slave_tcp(handler, request, node_id): self.stats.add_avg('register_slave_tcp') return self.register_node_tcp(handler, request, node_id, NodeType.slave)

MAIN FEATURES No-GIL: no GIL on our tasks. Balancing: tasks are balanced using load balancing. Nice to TCP: internal buffering to avoid TCP congestion. Deployment: easy to deploy application onto concurrent. Fast development: easy application framework to build concurrent applications that work on a high number of machines in minutes. Batching: task batching to simple task schemes. ITaskSystem approach: autonomous systems control tasks. Easy to implement concurrency in an organize fashion. Plug-able: all components are plug-able, flexible in development and favor adding new features. API: RESTful JSON API and TCP/ZMQ API in the same fashion. From the programmer calling a high-performance TCP method is the same as calling a web-service

FUTURE OF CONCURRENT GPU Processing: enable GPC task processing. Optimize network congestion: enable data syncronization and optimize locality of required data. Sandboxing: include a sandboxing feature so that tasks from different applications do not collide. Security: add cerificates and encryption layers on the ZMQ compute channel. Statistics and monitoring: include statistics and real-time task monitoring into the web interfaces of each node. Asyn I/O: optimize servers to use async I/O for optimal task distribution. Multi-master: implement a multi-master environment using a DHT and a NCS (Network Coordinate System).

SAMPLES Concurrent comes with a set of samples that should outline the power of the framework. They also guide how an application for concurrent should be created and what has to be avoided. Mandelbrot Benchmark MD5 hash reverse DNA curve analysis

MANDELBROT SAMPLE Sample implanted using plain tasks and an ITaskSystem. Comes in two flavors of tasks, an optimized and a non-optimized tasks on the data usage side.

MANDELBROT SAMPLE CONT'D Execution time between an ITaskSystem and a plain task implementation using the non-optimized tasks.

MANDELBROT SAMPLE CONT'D Execution time between an ITaskSystem and a plain task implementation using optimized tasks. Both ways gain but the plain tasks experience the main boost.

BENCHMARK SAMPLE Sample using plain tasks and an ITaskSystem. A simple active wait task used to compare the framework and its setup to Amdahl's Law.

BENCHMARK SAMPLE CONT'D Execution time between an ITaskSystem and a plain task implementation. Both are nearly identical.

BENCHMARK SAMPLE CONT'D CPU usage of the sample indicating that we are using all available CPU power for the benchmark.

MD5 HASH REVERSE SAMPLE Very simple sample with low data transmission. Uses bruteforce to reverse a numerical MD5 hash knowing the range of numbers. The concurrent implementation is about 4 times faster then the sequential version.

DNA CURVE ANALYSIS SAMPLE The DNA sample is just a simple way to try to overload the system with over 10 thousand separate tasks. Each task requires a considerable amount of data and so sending it all at once has its drawbacks. The tests and benchmarks has been performed on a single machine and so we setup the worst possible distributed network resulting in network overloads. This the sample is not intended for real use, it outlines the stability of the framework.

DNA CURVE ANALYSIS SAMPLE CONT'D As for the other samples the amount of data send through the network is considerable, the sample itself reaches a high memory load up to 3 GB on the master. Sending huge amounts of data is the real bottleneck of any distributed task framework. We spend more time sending data then performing the actual computation. While in some cases sending less tasks with more data for each one is better then sending thousands of small tasks.

CONCLUSION Complexity: building a distributed task system is an extremely complex endeavor. Python: Python has proven to be a perfect choice for its flexibility, ease of development and speed using native modules. ITaskSystem vs plain tasks: depending on the problem each way fits in its own while in most cases we experienced better results using the ITaskSystem approach. Fair network usage: fair network usage avoiding congestion is vital to maintain a reliable system, sending too much data over the same socket will result in package loss and so in turn in resending of the data. Data locality: sending data, specially over the MTU (maximum transmission unit) size, has a great impact in the overall performance. Sending less data or performing local reads boosts the processing considerably.

CONCLUSION CONT'D GIL: threading in Python is a no-go, the GIL impacts heavy on the overall performance. Our master server implementations (TCPServer and TCPServerZMQ) are currently the bottleneck, while this can be addressed using async I/O and processes instead of threads. Sequential vs Parallel: we achieved a very good parallelization proportion for our samples, the benchmark sample achieved a speedup of about 740% and the Mandelbrot by 175% and 240% depending on the implementation. See next slide for details.

CONCLUSION CONT'D A main law creating any concurrent computing system is Amdahl's Law. Analyzing the performance from our samples, specially the Mandelbrot and the benchmark sample gives is fairly good speed up compared to the maximum speedup that could be achieved applying the law. Image courtesy of Wikipedia Mandelbrot: 2.4 = 1/((1-P)+(P/N)) where P = 0.67 and N = 8 Benchmark: 7.02 = 1/((1-P)+(P/N)) where P = 0.98 and N = 8 Max Speedup: 8 = 1/((1-P)+(P/N)) where P = 1 and N = 8

STELLAR LINKS Source code on GitHub Project page API Documentation Video presentation

THE END BY MORITZ WUNDKE / MORITZWUNDKE.COM Build with reveal.js