Parallel Interval Analysis for Chemical Process Modeling

Similar documents
ffl Interval Analysis Outline Interval Newton/Generalized Bisection (IN/GB). Reliability in finding (enclosing) all solutions and in solving global op

Reliable Nonlinear Parameter Estimation Using Interval Analysis: Error-in-Variable Approach

Motivation ffl In process modeling, chemical engineers frequently need to solve nonlinear equation systems in which the variables are constrained phys

Lecture 5: Search Algorithms for Discrete Optimization Problems

New Interval Methodologies for Reliable Chemical Process Modeling

Chapter 11 Search Algorithms for Discrete Optimization Problems

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Simplicial Global Optimization

B553 Lecture 12: Global Optimization

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Reliable Phase Stability Analysis for Excess Gibbs Energy Models

February 19, Integer programming. Outline. Problem formulation. Branch-andbound

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Shared-memory Parallel Programming with Cilk Plus

What is Performance for Internet/Grid Computation?

Reliable Nonlinear Parameter Estimation in VLE Modeling

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Load Balancing and Termination Detection

Search direction improvement for gradient-based optimization problems

Parallel Branch & Bound

Dr. Joe Zhang PDC-3: Parallel Platforms

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Interconnect Technology and Computational Speed

Parallel Algorithm Design. CS595, Fall 2010

Design of Parallel Algorithms. Course Introduction

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Parallel Implementations of Gaussian Elimination

BİL 542 Parallel Computing

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Parallel Implementation of 3D FMA using MPI

Scaling Distributed Machine Learning with the Parameter Server

Scalable Cache Coherence

Parallel Algorithm Design. Parallel Algorithm Design p. 1

SMD149 - Operating Systems - Multiprocessing

Interval arithmetic on graphics processing units

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Generalized subinterval selection criteria for interval global optimization

Principles of Wireless Sensor Networks. Fast-Lipschitz Optimization

Graph Algorithms. Many problems in networks can be modeled as graph problems.

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

Parallel Programming. Parallel algorithms Combinatorial Search

Programming, numerics and optimization

Multiprocessors - Flynn s Taxonomy (1966)

Scaling Distributed Machine Learning

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Load Balancing Part 1: Dynamic Load Balancing

Module 15: Network Structures

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

IOS: A Middleware for Decentralized Distributed Computing

Crash-Starting the Simplex Method

Degrees of Freedom in Cached Interference Networks with Limited Backhaul

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Shared-Memory Parallelization of an Interval Equations Systems Solver Comparison of Tools

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Heckaton. SQL Server's Memory Optimized OLTP Engine

Lecture 3: Sorting 1

Module 16: Distributed System Structures

Semi-supervised learning and active learning

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Adaptive-Mesh-Refinement Pattern

CS Parallel Algorithms in Scientific Computing

Parallel Architectures

Support Vector Machines.

Adventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps

Application of an interval optimization method for studying feasibility of batch extractive distillation

Combinatorial Search; Monte Carlo Methods

GT "Calcul Ensembliste"

Control Schemes in a Generalized Utility for Parallel Branch-and-Bound Algorithms

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

A PACKAGE FOR DEVELOPMENT OF ALGORITHMS FOR GLOBAL OPTIMIZATION 1

Lecture 9: Load Balancing & Resource Allocation

Shared-memory Parallel Programming with Cilk Plus

Lecture 4: Graph Algorithms

Distributed Systems - I

Runtime Support for Scalable Task-parallel Programs

Monte Carlo Methods; Combinatorial Search

Determination of Inner and Outer Bounds of Reachable Sets

ECE 669 Parallel Computer Architecture

A System for Bidirectional Robotic Pathfinding

CSC630/CSC730 Parallel & Distributed Computing

Constraint Satisfaction Problems

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Implementing Scalable Parallel Search Algorithms for Data-intensive Applications

Patterns for! Parallel Programming II!

Kernel Methods & Support Vector Machines

Principles of Parallel Algorithm Design: Concurrency and Mapping

Feature Selection for Image Retrieval and Object Recognition

Transcription:

arallel Interval Analysis for Chemical rocess Modeling Chao-Yang Gau and Mark A. Stadtherr Λ Department of Chemical Engineering University of Notre Dame Notre Dame, IN 46556 USA SIAM CSE 2000 Washington, D.C. September 21 24, 2000 Λ Fax: (219)631-8366; E-mail: markst@nd.edu

Outline ffl Motivation: Reliability in Computing ffl Methodology: Interval Newton/Generalized Bisection ffl arallel Implementation on a Cluster of Workstations ffl Some erformance Results CSE 2000 2

High erformance Computing In chemical engineering and other areas of engineering and science, high performance computing is providing the capability to: ffl Solve problems faster ffl Solve larger problems ffl Solve more complex problems ) Solve problems more reliably CSE 2000 3

Motivation ffl In process modeling and other applications, chemical engineers frequently need to solve nonlinear equation systems in which the variables are constrained physically within upper and lower bounds; that is, to solve: f (x) = 0 x L» x» x U ffl These problems may: Have multiple solutions Have no solution Be difficult to converge to any solution CSE 2000 4

Motivation (cont d) ffl There is also frequent interest in globally minimizing a nonlinear function subject to nonlinear equality and/or inequality constraints; that is, to solve (globally): min x ffi(x) subject to h(x) = 0 g(x) 0 x L» x» x U ffl These problems may: Have multiple local minima (in some cases, it may be desirable to find them all) Have no solution (infeasible NL) Be difficult to converge to any local minima CSE 2000 5

Interval Newton/Generalized Bisection ffl Given initial bounds on each variable, IN/GB can: Find (enclose) any and all solutions to a nonlinear equation system to a desired tolerance Determine that there is no solution of a nonlinear equation system Find the global optimum of a nonlinear objective function ffl This methodology: rovides a mathematical guarantee of reliability Deals automatically with rounding error, and so also provide a computational guarantee of reliability Represents a particular type of branch-andprune algorithm (or branch-and-bound for optimization) CSE 2000 6

IN/GB (cont d) ffl For solving a nonlinear equation system f (x) = 0 the interval Newton method provides pruning conditions; IN/GB is a branch-and-prune scheme on a binary tree ffl No strong assumptions about the function f (x) need be made ffl The problem f (x) = 0 must have a finite number of real roots in the given initial interval ffl The method is not suitable if f (x) is a black-box function ffl If there is a solution at a singular point, then existence and uniqueness cannot be confirmed the eventual result of the IN/GB approach will be a very narrow enclosure that may contain one or more solutions CSE 2000 7

IN/GB (cont d) ffl Can be extended to global optimization problems ffl For unconstrained problems, solve for stationary points ffl For constrained problems, solve for KKT points (or more generally for Fritz-John points) ffl Add an additional pruning condition: Compute interval extension (bounds on range) of objective function If its lower bound is greater than a known upper bound on the global minimum, prune this subinterval since it cannot contain the global minimum ffl This is a branch-and-bound scheme on a binary tree CSE 2000 8

Some Types of roblems Solved ffl Fluid phase stability and equilibrium (e.g. Hua et al., 1998) ffl Location of azeotropes (Maier et al., 1998, 1999, 2000) ffl Location of mixture critical points (Stradi et al., 2000) ffl Solid-fluid equilibrium (Xu et al., 2000) ffl arameter estimation (Gau and Stadtherr, 1999, 2000) ffl hase behavior in porous materials (Maier and Stadtherr, 2000) ffl General process modeling problems up to 163 equations (Schnepper and Stadtherr, 1996) CSE 2000 9

arallel Branch-and-Bound Techniques ffl BB and B involve successive subdivision of the problem domain to create subproblems, thus requiring a tree search process Applications are often computationally intense Subproblems (tree nodes) are independent A natural opportunity for use of parallel computing ffl For practical problems, the binary tree that needs to be searched in parallel may be quite large ffl The binary trees may be highly irregular, and can result in highly uneven distribution of work among processors and thus poor overall performance (e.g., idle processors) CSE 2000 10

arallel BB (cont d) ffl Need an effective work scheduling and load balancing scheme to do parallel tree search efficiently ffl Manager-worker schemes (centralized global stack management) are popular but may scale poorly due to communication expense and bottlenecks ffl Many implementations of parallel BB have been studied (Kumar et al., 1994; Gendron and Crainic, 1994) for various target architectures ffl There are various BB and B schemes; we use an interval Newton/generalized bisection (IN/GB) method CSE 2000 11

Work Scheduling and Load Balancing ffl Objective: Schedule the workload among processors to minimize communication delays and execution time, and maximize computing resource utilization ffl Use Dynamic Scheduling Redistribute workload concurrently at runtime. Transfer workload from a heavily loaded processor to a lightly loaded one (load balancing) ffl Target architecture: Distributed computing on a networked cluster using message passing Often relatively inexpensive Uses widely available hardware ffl Use distributed (multiple pool) load balancing CSE 2000 12

Distributed Load Balancing ffl Each processor locally makes the workload placement decision to maintain the local interval stack and prevent itself from becoming idle ffl Alleviates bottleneck effects from centralized load balancing policy (manager/worker) ffl Reduction of communication overhead could provide high scalability for the parallel computation ffl Components of typical schemes Workload state measurement State information exchange Transfer initiation Workload placement Global termination CSE 2000 13

Components ffl Workload state measurement Evaluate local workload using some work index Use stack length: number of intervals (boxes) remaining to be processed ffl State information exchange Communicate local workload state to other cooperating processors Selection of cooperating processors defines a virtual network Virtual network: Global (all-to-all), 1-D torus, 2-D torus, etc. ffl Transfer initiation Sender initiate Receiver initiate Symmetric (sender or receiver initiate) CSE 2000 14

Components (cont d) ffl Workload placement Work-adjusting rule: How to distribute work (boxes) among cooperating processors and how much to transfer Work stealing (e.g., Blumofe and Leiserson, 1994) Diffusive propagation (e.g., Heirich and Taylor, 1995) Etc. Work-selection rule: Which boxes should be transferred Breadth first Best first (based on the lower bound value) Depth first Various heuristics ffl Global termination Easy to detect with synchronous, all-to-all communication For local and/or asynchronous communication, use Dijkstra s token algorithm CSE 2000 15

arallel Implementations ffl Three types of strategies were implemented. Synchronous Work Stealing (SWS) Synchronous Diffusive Load Balancing (SDLB) Asynchronous Diffusive Load Balancing (ADLB) ffl These are listed in order of likely effectiveness. ffl All were implemented in Fortran-77 using LAM (Local Area Multicomputer) MI (Laboratory for Scientific Computing, University of Notre Dame) CSE 2000 16

Synchronous Work Stealing ffl eriodically exchange workload information (workflg) and any improved upper bound value (for optimization) using synchronous global (all-to-all) blocking communication ffl Once idle, steal one interval (box) from the processor with the heaviest work load (receiver initiate) ffl Difficulties Large network overhead (global, all-to-all) Idle time from process synchronism and blocking communication 0 1 2 3 After T tests Comp. MI_ALLGATHER workflg = no. of stack boxes Make placement decision Transfer workload box box Comm. Comp. CSE 2000 17

Synchronous Diffusive Load Balancing ffl Use local communication: rocessors periodically exchange work state and units of work with their immediate neighbors to maintain their workload ffl Typical workload adjusting scheme (symmetric initiation): u(j) = 0:5[workflg(i) workflg(j)] (i: local processor: j: neighbor processor) If u(j) is positive and greater than some tolerance: send intervals (boxes) If u(j) is negative and less than some tolerance: receive intervals (boxes) ffl Messages have higher granularity ffl Synchronism and blocking communication still cause inefficiencies CSE 2000 18

Synchronous Diffusive Load Balancing 0 1 2 3 After T tests Comp. Exchange workload state information Make placement decision Workload transfer box box Comm. Comp. Before balancing After balancing Concentration CSE 2000 19

Asynchronous Diffusive Load Balancing ffl Use asynchronous nonblocking communication to send workload information and transfer workload ffl Overlaps communication and computation ffl Receiver-initiated diffusive workload transfer scheme: Send out work state information only if it falls below some threshold Donor processor follows diffusive scheme to determine amount of work to send (if any) Recognizes that workload balance is less important than preventing idle states ffl Dijkstra s token algorithm used to detect global termination CSE 2000 20

Asynchronous Diffusive Load Balancing i (Flexible sequence) Send out workflg(i) Receive workflg(j) Comp. Comm. Comp. Comm. Comp. Send out boxes Receive boxes Comm. Comp. Comm. Comp. CSE 2000 21

Testing Environment ffl hysical hardware: Sun Ultra workstations connected by switched Ethernet (100Mbit) M $ M M M $ $ $ SWITCHED ETHERNET ffl Virtual Network: Global Communication All-to-All Network Local Communication 1-D Torus Network Used for SWS Used for SDLB and ADLB CSE 2000 22

Test roblem ffl arameter estimation in a vapor-liquid equilibrium model ffl Use the maximum likelihood estimator as the objective function to determine model parameters that give the best fit ffl roblem data and characteristics chosen to make this a particularly difficult problem ffl Can be formulated as a nonlinear equation solving problem (which has five solutions) ffl Or can be formulated as a global optimization problem CSE 2000 23

Comparison of Algorithms on Equation-Solving roblem Speedup vs. Number of rocessors ADLB vs. SDLB vs. SWS 16 14 12 Speedup 10 8 6 4 2 SWS SDLB ADLB Linear Speedup 2 4 6 8 10 12 14 16 Number of rocessors CSE 2000 24

Comparison of Algorithms on Equation-Solving roblem Efficiency vs. Number of rocessors ADLB vs. SDLB vs. SWS Efficiency 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SWS SDLB ADLB 2 4 6 8 10 12 14 16 Number of rocessors CSE 2000 25

Using ADLB on Optimization roblem Speedup vs. Number of rocessors (three different runs of same problem) Speedup 52 48 44 40 36 32 28 24 20 16 12 8 4 0 0 2 4 6 8 10 12 14 16 Number of rocessors CSE 2000 26

Using ADLB on Optimization roblem ffl Speedups around 50 on 16 processors superlinear speedup ffl Superlinear speedup is possible because of broadcast of least upper bounds, causing intervals to be discarded earlier than in the serial case; that is, there is less work to do in the parallel case than in the serial case ffl Results vary from run to run because of different timing in finding and broadcasting improved upper bound CSE 2000 27

Effect of Virtual Network ffl We have also considered performance in a 2-D torus virtual network 1-D Torus Network 2-D Torus Network ffl 1-D vs. 2-D torus 2-D has higher communication overhead (more neighbors) 2-D has smaller network diameter (shorter message diffusion distance): 2b p =2c vs. b=2c Trade off may favor 2-D for large number of processors CSE 2000 28

Effect of Virtual Network ffl ADLB algorithm was tested using both 1-D and 2-D virtual connectivity. ffl The test problem is an equation solving problem: computation of critical points of mixtures ffl Comparisons made using isoefficiency analysis: As number of processors is increased, determine problem size needed to maintain constant efficiency relative to best sequential algorithm ffl Isoefficiency curves at 92% were determined up to 32 processors CSE 2000 29

Isoefficiency Curves (92%) for Equation-Solving roblem 2-D Torus vs. 1-D Torus (Lower is better) 12 10 1 D Torus 2 D Torus log 2 (roblem Size) 8 6 4 2 0 1 1.5 2 2.5 3 3.5 4 4.5 5 log 2 CSE 2000 30

Stack Management for Workload lacement ffl Especially for optimization problems, the selection rule for workload transfer can have a significant effect on performance ffl With the goal of maintaining consistently high (superlinear) speedups on optimization (BB) problems, we have used a dual stack management scheme ffl Each processor maintains two workload stacks, a local stack and a global stack The processor draws work from the local stack in the order in which it is generated (depth-first pattern) The global stack provides work for transmission to other processors The global stack is created by randomly removing boxes from the local stack, contributing breadth to the tree search process CSE 2000 31

Workload lacement (cont d) ffl The dual stack strategy was tested using a 2-D torus virtual network up to 32 processors ffl The test problem was an optimization problem: parameter estimation using an error-in-variable approach ffl For comparisons, an ultimate speedup was determined by initially setting the best upper bound to the value of the global minimum ffl Results indicate that the dual stack strategy leads to higher speedups and less variability from run to run (based on 10 runs of each case) CSE 2000 32

Workload lacement (cont d) Speedup vs. Number of rocessors Dual Stack vs. Single Stack vs. Ultimate 44 40 36 32 Single Stack Dual Stack Ultimate Speedup Linear Speedup Speedup 28 24 20 16 12 8 4 0 0 4 8 12 16 20 24 28 32 CSE 2000 33

Concluding Remarks ffl IN/GB is a powerful general-purpose and model-independent approach for solving a variety of process modeling problems, providing a mathematical and computational guarantee of reliability ffl Continuing advances in computing hardware and software (e.g., compiler support for interval arithmetic, parallel computing) will make this approach even more attractive ffl With effective load management strategies, parallel BB and B problems (using IN/GB or other approaches) can be solved very efficiently using MI on a networked cluster of workstations Good scalability Exploit potential for superlinear speedup in BB ffl arallel computing technology can be used not only to solve problems faster, but to solve problems more reliably CSE 2000 34

Acknowledgments ACS RF 30421-AC9 and 35979-AC9 NSF DMI96-96110 and EEC97-00537-CRCD US ARO DAAG55-98-1-0091 Sun Microsystems, Inc. For more information: Contact rof. Stadtherr at markst@nd.edu See also http://www.nd.edu/ markst CSE 2000 35