Parallel Processing. Majid AlMeshari John W. Conklin. Science Advisory Committee Meeting September 3, 2010 Stanford University

Similar documents
MPI CS 732. Joshua Hegie

pmatlab: Parallel Matlab Toolbox

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

300x Matlab. Dr. Jeremy Kepner. MIT Lincoln Laboratory. September 25, 2002 HPEC Workshop Lexington, MA

Parallel Data Mining on a Beowulf Cluster

Parallel Computing with MATLAB

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The Accelerator Toolbox (AT) is a heavily matured collection of tools and scripts

High Performance Computing and Data Mining

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Monitoring and Trouble Shooting on BioHPC

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

MapReduce: A Programming Model for Large-Scale Distributed Computation

Performance Monitor. Version: 7.3

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Speeding up MATLAB Applications The MathWorks, Inc.

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Introduction to Jackknife Algorithm

Parallel Mesh Partitioning in Alya

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Piccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Parallel programming in Matlab environment on CRESCO cluster, interactive and batch mode

Technical Computing Suite supporting the hybrid system

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

A Parallel Algorithm for Finding Sub-graph Isomorphism

Parallelizing AT with MatlabMPI. Evan Y. Li

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

Optimizing and Accelerating Your MATLAB Code

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

Deployment of SAR and GMTI Signal Processing on a Boeing 707 Aircraft using pmatlab and a Bladed Linux Cluster

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

MapReduce and Friends

Policy-Sealed Data: A New Abstraction for Building Trusted Cloud Services

Parallel Computing Basics, Semantics

THE AUSTRALIAN NATIONAL UNIVERSITY Mid Semester Examination April 2010 COMP3320/6464. High Performance Scientific Computing

A Platform for Fine-Grained Resource Sharing in the Data Center

Graph Partitioning for Scalable Distributed Graph Computations

Near Memory Key/Value Lookup Acceleration MemSys 2017

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Boot Sequence OBJECTIVES RESOURCES DISCUSSION PROCEDURE LAB PROCEDURE 2

CUDA GPGPU Workshop 2012

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

An Overview of Fujitsu s Lustre Based File System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Introduction to R Programming

System Specification

Multiprocessor Systems. Chapter 8, 8.1

The Google File System

COMP5318 Knowledge Management & Data Mining Assignment 1

Parallel Computing Introduction

LLGrid: On-Demand Grid Computing with gridmatlab and pmatlab

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

1 Logging in 1. 2 Star-P 2

Performance Tools for Technical Computing

Parallel File Systems for HPC

This calculation converts 3562 from base 10 to base 8 octal. Digits are produced right to left, so the final answer is 6752.

Performance Benchmark and Capacity Planning. Version: 7.3

Entuity Network Monitoring and Analytics 10.5 Server Sizing Guide

Rapid prototyping of radar algorithms [Applications Corner]

Practical Introduction to

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example

Orleans. Cloud Computing for Everyone. Hamid R. Bazoobandi. March 16, Vrije University of Amsterdam

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Speeding up MATLAB Applications Sean de Wolski Application Engineer

6.1 Multiprocessor Computing Environment

System Design for a Million TPS

ibench: Quantifying Interference in Datacenter Applications

Multiprocessors & Thread Level Parallelism

SNS COLLEGE OF ENGINEERING

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Programming for Fujitsu Supercomputers

QUESTION BANK UNIT I

NUMA replicated pagecache for Linux

Large Scale Debugging of Parallel Tasks with AutomaDeD!

Cluster computing performances using virtual processors and Matlab 6.5

OpenFOAM Parallel Performance on HPC

Instruction for Mass Spectra Viewer

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Runtime Support for Scalable Task-parallel Programs

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

Alex A. Granovsky Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia May 10, 2003

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Performance comparisons and trade-offs for various MySQL replication schemes

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Transcription:

Parallel Processing Majid AlMeshari John W. Conklin 1

Outline Challenge Requirements Resources Approach Status Tools for Processing 2

Challenge A computationally intensive algorithm is applied on a huge set of data. Verification of filter s correctness takes a long time and hinders progress. 3

Requirements Achieve a speed-up between 5-10x over serial version Minimize parallelization overhead Minimize time to parallelize new releases of code Achieve reasonable scalability SAC 19 Time (sec) Number of Processors 4

Resources Hardware Computer Cluster (Regelation), a 44 64-bit CPUs Software» 64-bit enables us to address a memory space beyond 4 GB» Using this cluster because: (a) likely will not need more than 44 processors (b) same platform as our Linux boxes MatlabMPI» Matlab parallel computing toolbox created by MIT Lincoln Lab» Eliminates the need to port our code into another language 5

Approach - Serial Code Structure Initialization Loop over iterations Loop Gyros Loop over Segments Loop over orbits Build h(x) (the right hand side) Build Jacobian, J(x) Computationally Intensive Components (Parallelize!!!) Load data (Z), perform fit Bayesian estimation, display results 6

Initialization Approach - Parallel Code Structure Loop over iterations Loop Gyros Loop over Segments Loop over orbits Build h(x) Build Jacobian, J(x) Distribute J(x), all nodes Loop over partial orbits Load data (Z), perform fit Send information matrices to master & combine In parallel, split up by columns of J(x) Custom data distribution code required (Overhead) In parallel, split up by orbits Bayesian estimation, display results 7

Code in Action Split by Columns of J Initialization Go for another Iteration Split by Rows of J Go for next Gyro, Segment Proc 1 Proc 1 Proc 2 Proc 2 Proc 3 Proc 3.................. Proc n Proc n Compute h and Partial J Sync Point Get Complete J Perform Fit Sync Point Combine Info Matrix Bayesian Estimation Another Iteration? 8

Parallelization Techniques - 1 Minimize inter-processor communication 2000 1800 Use MatlabMPI for one-to-one communication only Use a custom approach for one-to-many communication Time (Sec) 1600 1400 1200 1000 800 600 400 400 200 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 11 12 13 14 15 16 Comm time with custom MPI approach Comp J Dist J (I/O) Est comp info (I/O) Processors 9

Parallelization Techniques - 2 Balance processing load among processors 2000 1800 Load per processor = length(reducedvector k )*(computationweight k + diskioweight) k: {RT, Tel, Cg, TFM, S 0 } Time (Sec) 1600 1400 1200 1000 800 600 400 200 CPUs have uniform load Comp J Dist J (I/O) Est comp info (I/O) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors 10

Parallelization Techniques - 3 Minimize memory footprint of slave code Tightly manage memory to prevent swapping to disk, or out of memory errors We managed to save ~40% of RAM per processor and never seen out of memory since 11

Parallelization Techniques - 4 Find contention points 4000 Cache what can be cached Optimize what can be optimized Example: Compute h Optimization 3500 3000 Time (Sec) 2500 2000 1500 1000 500 0 Old Code Optimized Code 12

Status Current Speed up 16.00 14.00 14.50 12.00 10.83 Speedup 10.00 8.00 6.00 5.76 7.19 6.50 9.62 7.22 Sep. '09 May '10 July '10 4.00 4.26 4.08 2.00 1.00 0.00 1 CPU 8 CPUs 16 CPUs 20 CPUs Number of Processors 13

Tools for Processing - 1 Run Manager/Scheduler/Version Controller Built by Ahmad Aljadaan Given a batch of options files, it schedules batches of runs, serial or parallel, on cluster Facilitates version control by creating a unique directory for each run with its options file, code and results Used with large number of test cases Helped us schedule 1048+ runs since January 2010» and counting 14

Tools for Processing - 2 Result Viewer Built by Ahmad Aljadaan Allows searching for runs, displays results, plots related charts Facilitates comparison of results 15

Questions 16