USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES

Size: px
Start display at page:

Download "USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES"

Transcription

1 Latin American Applied Researc 41: (2011) USING HYBRID PARALLEL PROGRAMMING TECHNIQUES FOR THE COMPUTATION, ASSEMBLY AND SOLUTION STAGES IN FINITE ELEMENT CODES R.R. PAZ, M.A. STORTI, H.G. CASTRO and L.D. DALCÍN Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC), Instituto de Desarrollo Tecnológico para la Industria Química (INTEC), CONICET, Universidad Nacional del Litoral (UNL). Santa Fe, Argentina. {rodrigop mstorti Grupo de Investigación en Mecánica de Fluidos, Universidad Tecnológica Nacional, Facultad Regional Resistencia, Caco, Argentina. Abstract Te so called ybrid parallelism paradigm, tat combines programming tecniques for arcitectures wit distributed and sared memories using MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) standards, is currently adopted to exploit te growing use of multi-core computers, tus improving te efficiency of codes in suc arcitectures (several multi-core nodes or clustered symmetric multi-processors (SMP) connected by a fast net to do exaustive computations). In tis paper a parallel ybrid finite element code is developed and its performance is evaluated, using MPI for communication between cluster nodes and OpenMP for parallelism witin te SMP nodes. An efficient tread-safe matrix library for computing element/cell residuals (or rigt and sides) and Jacobians (or matrices) in FEM-like codes is introduced and fully described. Te cluster in wic te code was tested is te CIMEC s Coyote cluster, wic consists of eigt-core computing nodes connected troug Gigabit Eternet. Keywords Finite Elements, MPI, OpenMP, PETSc, ybrid programming, Matrix Library. I. INTRODUCTION A variety of engineering applications and scientific problems related to Computational Mecanics (CM) area, and particularly in te Computational Fluid Dynamics (CFD) field, demand ig computational resources (Sonzogni et al., 2002). A great effort as been made over te years in order to obtain ig quality solutions (Paz et al., 2006) for large-scale problems in realistic time (Beara and Mittal., 2009) using many different computing arcitectures (e.g., vector processors, distributed and sared memories, grapic process units or GPGPU s). Symmetric multi-processors (SMP) involve a ardware arcitecture wit two or more identical processors connected to a single sared main memory. Oter recent computing systems migt use Non-Uniform Memory Access (NUMA). NUMA dedicates different memory banks to different processors; processors may access local memory quickly. Despite te differences wit NUMA arcitectures, tis work will use te term SMP in a broader sense to refer to a general many-processor many-core single-memory computing macine. Since SMP ave spread out widely in conjunction wit ig-speed network ardware, using SMP clusters ave become attractive for ig-performance computing. To exploit suc computing systems te tendency is to use te so called ybrid parallelism paradigm tat combines programming tecniques for arcitectures wit distributed and sared memory, often using MPI and OpenMP standards (Jost and Jin., 2003). Te ybrid MPI/OpenMP programming tecnique is based on using message passing for coarse-grained parallelism and multi-treading for fine-grained parallelism. Te MPI programming paradigm defines a ig-level abstraction for fast and portable inter-process communication and assumes a local/private address space for eac process. Applications can run in clusters of (possibly eterogeneous) workstations or dedicated nodes, SMP macines, or even a mixture of bot. MPI ides all te low-level details, like networking or sared memory management; simplifying development and maintaining portability, witout sacrificing performance (see Section V.B). Altoug message passing is te way to communicate between nodes, it could not be an efficient resource witin an SMP node. In sared memory arcitectures, parallelization strategies tat use OpenMP standard could provide better performances and efficiency in parallel applications. A combination of bot paradigms witin an application tat runs on ybrid clusters may provide a more efficient parallelization strategy tan tose applications tat exploit te features of pure MPI. Tis paper is focused on a mixed MPI and OpenMP implementation of a finite element code for scalar PDE s and discusses te benefits of developing mixed mode MPI/OpenMP codes running on Beowulf clusters of SMP s. To address tese objectives, te remainder of tis paper is organized as follows. Section II provides a sort description and comparison of different caracteristics of OpenMP and MPI paradigms. Section III introduces and describes an efficient tread-safe matrix library called Fast-Mat for computing element residuals and Jacobians in te context of multi-treaded finite element codes. Section IV discusses te implementation of mixed (ybrid) mode application and describes a number of situations were mixed mode programming is potentially beneficial. Section V presents te implementation of an ybrid application suc as advective- 365

2 Latin American Applied Researc 41: (2011) diffusive partial differential equation. Several tests are performed on a cluster of SMP s and on single SMP workstations (Intel Xeon E54xx-series and i7 arcitectures) comparing and contrasting te performance of te FEM code. Te results demonstrate tat tis style of programming may increase te code performance. Tis improvement can be acieved by taking into account a few rules of tumb depending on te application at and. Concluding remarks are given in section VI. II AN OVERVIEW OF MPI AND OPENMP A. MPI MPI, te Message Passing Interface (MPI, 2010), is a standardized and portable message-passing system designed to function on a wide variety of parallel computers. Te standard defines te syntax and semantics of library routines (MPI is not a programming language extension) and allows users to write portable programs in te main scientific programming languages (Fortran, C and C++). Te MPI programming model is a distributed memory model wit explicit control of parallelism. MPI is portable to bot distributed and sared memory arcitecture. Te explicit parallelism often provides a better performance and a number of optimized collective communication routines are available for optimal efficiency. MPI defines a ig-level abstraction for fast and portable inter-process communication (Snir et al., 1998; Gropp et al., 1998). MPI applications can run in clusters of (possibly eterogeneous) workstations or dedicated nodes, (symmetric) multi-processors macines, or even a mixture of bot. MPI ides all te low-level details, like networking or sared memory management, simplifying development and maintaining portability, witout sacrificing performance. Te MPI specifications is nowadays te leading standard for message passing libraries in te world of parallel computers. At te time of tis writing, clarifications to MPI-2 are being actively discussed and new working groups are being establised for generating a future MPI-3 specification. MPI provides te following functionality: Communication Domains and Process Groups: MPI communication operations occurs witin a specific communication domain troug an abstraction called communicator. Communicators are built from groups of participating processes and provide a communication context for te members of tose groups. Process groups enable parallel applications to assign processing resources in sets of cooperating processes in order to perform independent work. Point-to-Point Communication: Tis fundamental mecanism enables te transmission of data between a pair of processes, one side sending, te oter receiving. Collective Communications: Tey allow te transmission of data between multiple processes of a group simultaneously. Dynamic Process Management: MPI provides mecanisms to create or connect a set of processes and establis communication between tem and te existing MPI application. One-Sided Operations: One-sided communications supplements te traditional two-sided, cooperative send/receive MPI communication model wit a one-sided, remote put/get operation of specified regions of processes memory tat ave been made available for read and write operations. Parallel Input/Output. B. OpenMP OpenMP, te Open specifications for Multi-Processing (OpenMP, 2010), defines an application program interface tat supports concurrent programming employing a sared memory model. OpenMP is available for several platforms and languages. Tere are extensions for most known languages like Fortran (77, 90 and 95) and C/C++. OpenMP is te result of te joint work between companies and educational institutions involved in te researc and development of ardware and software. OpenMP is based on te fork-join model (see Fig. 1), a paradigm implemented early on UNIX systems, were a task is divided into several processes (fork) wit less weigt tan te initial task, and ten collecting teir results at te end and merging into a single result (join). Using OpenMP in existing codes implies te insertion of special compiler directives (beginning wit #pragma omp in C/C++) and runtime routine calls. Additionally, environment variables can be defined in order to control some functionality at execution time. A code parallelized using OpenMP directives initially runs as a single process by te master tread or main process. Upon entering into a region to be parallalized te main process creates a set of parallel processes or parallel treads. Including an OpenMP directive implies a mandatory syncronization across te parallel block. Tat is, te code block is marked as parallel and treads are launced according to te caracteristics of te directive. At te end of te parallel block, working treads are syncronized (unless an additional directive is inserted to remove tis implicit syncronization). By default, in te parallelized region, all variables except tose used in loops are sared by eac tread. Tis can be canged by specifying te variable type (private or sared) before get into te parallel region. Tis model is also applicable to nested parallelism. C. PETSc PETSc (Balay et al., 2010a,b), te Portable Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for te scalable (parallel) solution of scientific applications modeled by partial Figure 1: Te master tread creates a team of parallel treads. 366

3 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN Figure 2: Hierarcical structure of te PETSc library (taken from PETSc Manual (Balay et al., 2010a)). differential equations. It employs te MPI standard for all message-passing communication. Being written in C and based on MPI, PETSc is a igly portable software library. PETScbased applications can run in almost all modern parallel environment, ranging from distributed memory arcitectures (Balay et al., 1997) (wit standard networks as well as specialized communication ardware) to multi-processor (and multi-core) sared memory macines. PETSc library provides to its users a platform to develop applications exploiting fully parallelism and te flexibility to experiment many different models, linear and nonlinear large system solving metods avoiding explicit calls to MPI library. It is a freely available library usable from C/C++, Fortran 77/90 and Pyton (Dalcin, 2010). An overview of some of te components of PETSc can be seen in Fig. 2. An important feature of te package is te possibility to write applications at a ig level and work te way down in level of abstraction (including explicit calls to MPI). As PETSc employs te distributed memory model, eac process as its own address space. Data is communicated using MPI wen required. For instance, in a linear (or nonlinear) system solution stage (a common case in FEM applications) eac process will own a contiguous subset of rows of te system matrix (in te C implementation) and will primarily work on tis subset, sending (or receiving) information to (or from) oter processes. PETSc interface allows users an agile development of parallel applications. PETSc provides sequential/ distributed matrix and vector data structures, efficient parallel matrix/vector assembly operations using an object oriented style. Also, several iterative metods for linear/nonlinear solvers are designed in te same way. III THE FastMat MATRIX CLASS A. Preliminaries Finite element codes usually ave two levels of programming. In te outer level a large vector describes te state of te pysical system. Usually te size of tis vector is te number of nodes times te number of fields minus te number of constraints (e.g. Diriclet boundary conditions). So tat te state vector size is N nod n dof -n constr. Tis vector can be computed at once by assembling te rigt and side (RHS) and te stiffness matrix in a linear problem, iterated in a non-linear problem or updated at eac time step troug solution of a linear or non-linear system. Te point is tat at tis outer level all global assemble operations, tat build te residual vector and matrices, are performed. At te inner level, one performs a loop over all te elements in te mes, compute te RHS vector and matrix contributions of eac element and assemble tem in te global vector/matrix. From one application to anoter, te strategy at te outer level (linear/non-linear, steady/temporal dependent, etc) and te pysics of te problem tat defines te FEM matrices and vectors may vary. Te FastMat matrix class as been designed in order to perform matrix computations efficiently at te element level. One of te key points in te design of te matrix library is te caracteristic of being tread-safe so tat it can be used in an SMP environment witin OpenMP parallel blocks. In view of efficiency tere is an operation cacing mecanism wic will be described later. Cacing is also tread-safe provided tat independent cace contexts are used in eac tread. It is assumed tat te code as an outer loop (usually te loop over elements) tat is executed many times, and at eac execution of te loop a series of operations are performed wit a rater reduced set of local (or element) vectors and matrices. In many cases, FEM-like algoritms need to operate on sub-matrices i.e., columns, rows or sets of tem. In general, performance is degraded for suc operations because tere is a certain amount of work needed to extract or set te sub-matrix. Oterwise, a copy of te row or column in an intermediate object can be made, but some overead is expected due to te copy operations. Te particularity of FastMat is tat at te first execution of te loop te address of te elements used in te operation are caced in an internal object, so tat in te second and subsequent executions of te loop te addresses are retrieved from te cace. Te library is of public domain and can be accessed from (Storti et al., 2010). A.1 Example Consider te following simple example: A given 2D finite element mes composed by triangles, i.e. an array xnod of 2N nod doubles wit te node coordinates and an array icone wit 3n elem elements wit node connectivities. For eac element 0j<n elem its nodes are stored at icone[3*j+k] for 0k<2. For instance, it is required to compute te maximum and minimum value of te area of te triangles. Tis is a computation wic is quite similar to tose found in FEM analysis. For eac element in te mes two basic operations are needed: i) loading te node coordinates in local vectors x 1, x 2 and x 3, ii) computing te vectors along te sides of te elements a=x 2 x 1 and b=x 3 x 1. Te area of te element is, ten, te determinant of te 22 matrix J formed by putting a and b as rows. Te FastMat code for te proposed computations is sown in Listing

4 Latin American Applied Researc 41: (2011) Listing 1: Simple FEM-like code Calls to te FastMat::CaceCtx ctx object are related to te cacing manipulation and will be discussed later. Matrices are dimensioned in line 3, te first argument is te matrix (rank), and ten, follow te dimensions for eac index rank or sape. For instance FastMat x(2,3,2) defines a matrix of rank 2 and sape (3,2), i.e., wit 2 indices ranging from 1 to 3, and 1 to 2 respectively. Te rows of tis matrix will store te coordinates of te local nodes to te element. FastMat matrices may ave any number of indices or rank. Also tey can ave zero rank, wic stands for scalars. A.2 Current matrix views (te so-called masks ) In lines 7 to 10 of code Listing 1 te coordinates of te nodes are loaded in matrix x. Te underlying pilosopy in FastMat is tat views (or masks ) of te matrix can be made witout making any copies of te underlying values. For instance te operation x.ir(1,k) (for index restriction ) sets a view of x so tat index 1 is restricted to take te value k reducing in one te rank of te matrix. As x as two indices, te operation x.ir(1,k) gives a matrix of dimension one consisting in te k-t row of x. A call witout arguments like in x.ir(.) cancels te restriction. Also, te function rs(.) (for reset ) cancels te actual view. Please, refer to te Appendix A for a synopsis of metods/operations available in te FastMat class. A.3 Set operations Te operation a.set(x.ir(1,2)) copies te contents of te argument x.ir(1,2) in a. Also, x.set(xp) can be used, being xp an array of doubles (double *xp). A.4 Dimension matcing Te x.set(y) operation, were y is anoter FastMat object, requires tat x and y ave te same masked dimensions. As te.ir(1,2) operation restricts index to te value of 2, x.ir(1,2) is seen as a row vector of size 2 and ten can be copied to a. If te masked dimensions do not fit ten an error is issued. A.5 Automatic dimensioning In te example, a as been dimensioned at line 3, but most operations perform te dimensioning if te matrix as not been already dimensioned. For instance, if at line 3 a FastMat a is declared witout specifying dimensions, ten at line 12, te matrix is dimensioned taking te dimensions from te argument. Te same applies to set(matrix &) but not to set(double *) since in tis last case te argument (double *) does not give information about is dimensions. Oter operations tat define dimensions are products and contraction operations. A.6 Concatenation of operations Many operations return a reference to te matrix (return value FastMat &) so tat operations may be concatenated as in A.ir(1,k).ir(2,j). A.7 Underlying implementation wit BLAS/LAPACK Some functions are implemented at te low level using BLAS(BLAS, 2010)/LAPACK(Lapack, 2010). Notably prod() uses BLAS s dgemm so tat te amortized cost of te prod() call is te same as for dgemm(). As a matter of fact, a profiling study of FastMat efficiency in a typical FEM code as determined tat te largest CPU consumption in te residual/jacobian computation stage corresponds to prod() calls. Anoter notable case is eig() tat uses LAPACK dgeev. Te eig() metod is not commonly used, but if it does, its cost may be significant so tat a fast implementation as proposed ere wit dgeev is mandatory. A.8 Te FastMat operation cace concept Te idea wit caces is tat tey are objects (class FastMatCace) tat store te addresses and any oter information tat can be computed in advance for te current operation. In te first pass troug te body of te loop (i.e., ie=0 in te example of Listing 1) a cace object is created for eac of te operations, and stored in a list. Tis list is basically a doubly linked list (list< >) of cace objects. Wen te body of te loop is executed te second time (i.e., ie>1 in te example) and te following, te addresses of te matrix elements are not needed to be recomputed but tey are read from te cace instead. Te use of te cace is rater automatic and requires little intervention by te user but in some cases te position in te cace-list can get out of syncronization wit respect to te execution of te operations and severe errors may occur. Te basic use of cacing is to create te cace structure FastMat::CaceCtx ctx and keep te position in te cace structure syncronized wit te position of te code. Te process is very simple, wen te code consists in a linear sequence of FastMat operations tat are executed always in te same order. In tis case te CaceCtx object stores a list of te cace objects (one for eac Fast-Mat operation). As te operations are ex- 368

5 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN ecuted te internal FastMat code is in carge of advancing te cace position in te cace list automatically. A linear sequence of cace operations tat are executed always in te same order is called a branc. Looking at te previous code, it as one branc starting at te x.ir(1,k).set(...) line, troug te J.rs().det() line. Tis sequence is repeated many times (one for eac element) so tat it is interesting to reuse te cace list. For tis, a branc object b1 (class FastMat:: Cace Ctx::Branc) and a jump to tis branc are created eac time a loop is executed. In te first loop iteration te cace list is created and stored in te first position of te cace structure. In te next and subsequent executions of te loop, te cace is reused avoiding recomputing many administrative work related wit te matrices. Te problem is wen te sequence of operations is not always te same. In tat case several jump() commands must be issued, eac one to te start of a sequence of Fast-Mat operations. Consider for instance te following code, A vector x of size 3 is randomly generated in a loop (te line x.fun(rnd);). Ten its lengt is computed, and if it is sorter tan 1.0 it is scaled by 1.0/len, so tat its final lengt is one. In tis case two brances are defined and two jumps are executed, branc b1: operations x.fun() and x.norm_p_all(), branc b2: operation x.scale(). B. Cacing te addresses used in te operations If cacing is not used te performance of te library is poor wile te caced version is very fast, in te sense tat almost all te CPU time is spent in performing multiplications and additions, and negligible CPU time is spent in auxiliary operations. B.1 Brancing is not always needed However, brancing is needed only if te instruction sequence canges during te same execution of te code. For instance, if a code like follows is considered te metod flag is determined at te moment of reading te data and ten is left uncanged for te wole execution of te code, ten it is not necessary to jump() since te instruction sequence will be always te same. B.2 Cace mismatc Te cace process may fail if a cace mismatc is produced. For instance, consider te following variation of te previous code Tere is an additional block in te conditional, if te lengt of te vector is greater tan 1.1, ten te vector is set to te null vector. Every time tat a branc is opened in a program block a ctx.jump() must be called using different arguments for te brances (i.e., b1, b2, etc). In te previous code tere are tree brances. Te code sown is correct, but assume tat te user forgets te jump() calls at lines 10 and 13 (sentences ctx.jump(b2) and ctx.jump (b3)), ten wen reacing te x.set(0.0), te operation in line 14, te corresponding cace would be te cace corresponding to te x.scale() operation (line 11), and an incorrect computation will occur. Eac time tat te retrieved cace does not matc wit te operation tat will be computed or even wen it does not exist a cace mismatc exception is produced. B.3 Causes for a cace mismatc error Basically, te information stored in te cace (and ten, retrieved from te objects tat were passed in te moment of creating te cace) must be te same needed for performing te current FastMat operation, tat is Te FastMat matrices involved must be te same, (i.e. teir pointers to te matrices must be te same). Te indices passed in te operation must coincide (for instance for te prod(), ctr, sum() operations). Te masks (see III.A.2) applied to eac of te matrix arguments must be te same. B.4 Multi-treading and reentrancy If cacing is not enabled, FastMat is tread safe. If cacing is enabled, ten it is tread safe in te following sense, a context ctx must be created for eac tread, and te matrices used in eac tread must be associated wit te context of tat tread. If creating te cace structures eac time is too bad for efficiency, ten te context and te matrices may be used in a parallel region, stored in variables, and reused in a subsequent parallel region. 369

6 Latin American Applied Researc 41: (2011) Figure 3: Efficiency comparatives on Intel Core 2 Duo T5450 (1.66GHz) processor C. Efficiency Tis bencmark computes a=b*c in a loop for a large number N of distinct square matrices a, b, c of varying size n. As mentioned before te amortized cost is te same as te underlying dgemm() call. Te processing rate in Gflops is computed on te base of an operation count per matrix product, i.e. 2n N 2n rategflops 10 (1) elapsed time secs Te number of times for reacing a 50% amortization (n 1 /2) is in te order of 15 to 30. Wit respect to te dgemm() implementation for te matrix product, several options were tested on an Intel Core 2 Duo T5450 (1.66GHz) processor using one core (see Fig. 3). Te options tested were: i) te Atlas (Waley et al., 2001) selfconfigurable version, bot wit te default setup and selfconfigured; ii) te Intel Mat Kernel Library (MKL) library, wit te GNU/GCC g++ and iii) Intel icc compilers. A significant improvement is obtained if te library is linked to te MKL, combined wit eiter icc or g++ compiler. Tis combinations peak at 5 to 6 Gflops for 100<n<200. Wit operation cacing activated te overead of te mask computation is avoided, and te amortized cost is similar to using dgemm() on plain C matrices. D. FastMat Multi-product operation A common operation in FEM codes (see Section V) and many oter applications is te product of several matrices or tensors. In addition, tis kind of operation usually consume te largest part of te CPU time in a typical FEM computation of residuals and Jacobians. Te number of operations (and consequently te CPU time) can be largely reduced by coosing te order in wic te products are performed. For instance consider te following operation ij ik kl lj A B C D, (2) (Einstein s convention on repeated indices is assumed). Te same operation using matricial notation is A BCD, (3) were A, B, C, D are rectangular (rank 2) matrices of sape (m 1,m 4 ), (m 1,m 2 ), (m 2,m 3 ), (m 3,m 4 ), respectively. As te matrix product is associative te order in wic te 370 computations are performed can be cosen at will, so tat it can be performed in te following two ways A BCD, (computation tree CT1),, (4) A BCD, (computation tree CT2). Te order in wic te computations are performed can be represented by a Complete Binary Tree. In tis paper te order will be described using parenteses. Te number of operations (op. count) for te different trees, and in consequence te CPU time, can be very different. Te cost of performing te first product BC in te first row of (4) (and a product of two rectangular matrices in general) is op.count 2m1m2 m, (5) 3 wic can be put as, op.count 2m1 m3 m2, (6) 2(prod.of dimsfor B andcfreeindices) (prod.of or alternatively, (m1m2)(m2m3) m1m 3 m2m op.count 2 m dimsfor contractedindices) 2 3, (7) (prod.of B dims) (prod.of Cdims) 2 (prod.of dimsfor contractedindices) and for te second product is 2m 1 m 3 m 4 so tat te total cost for te first computation tree CT1 is 2m 1 m 3 (m 2 + m 4 ). If te second computation tree CT2 is used, ten te number of operations is 2m 2 m 4 (m 1 + m 3 ). Tis numbers may be very different, for instance wen B and C are square matrices and D is a vector, i.e. m 1 =m 2 =m 3 = m> 1, m 4 =1. In tis case te operation count is 2m 2 (m+ 1) = O(m 3 ) for CT1 and 4m 2 = O(m 2 ) for CT2, so tat CT2 is muc more convenient. D.1. Algoritms for te determination of te computation tree Tere is a simple algoritm tat exploits tis euristic rule in a general case (Ao et al., 1983). If te multi-product is R A A, (8) 1 2 wit A k of sape (m k m k+1 ), ten te operation count c k for eac of te possible products A k A k+1, is computed, namely c k = m k m k+1 m k+2 for k =1 to n-1. Let c k* be te minimum operation count, ten te corresponding product A k* A k*+1 is performed and te pair A q, A q+1 is replaced by tis A n

7 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN product. Ten, te list of matrices in te multi-product is sortened by one. Te algoritm proceeds recursively until te number of matrices is reduced to only one. Te cost of tis algoritm is O(n 2 ) (please note tat tis refers to te number of operations needed to determine te computation tree, not in actually computing te matrix product). For a small number of matrices te optimal tree may be found by performing an exaustive searc over all possible orders. Te cost is in tis case O(n!). In Fig. 4 te computing times of te exaustive optimal and te euristic algoritms is sown for a number of matrices up to 8. Of course it is proibitive for a large number of matrices, but it can be afforded for up to 6 or 7 matrices, wic is by far te most common case. Te situation is basically te same but more complex in te full tensorial case, as it is implemented in te FastMat library. First consider a product of two tensors like tis A B C, (9) ijk klj were tensors A, B, C ave sape (m 1,m 2,m 3 ), (m 3,m 4,m 2 ), (m 1,m 4 ) respectively. i, j, k are free indices wile l is a contracted index. Te cost of tis product is (Eqs. 6 and 7) il op.count m m m m, (10) On one and, te modification wit respect to te case of rectangular (rank 2) matrices is tat every matrix can be contracted wit any oter in te list. So tat, te euristic algoritm must ceck now all te pair of distinct matrices wic is n (n - 1)/2 were 1n n is te number of actual matrices. Tis must be added over n so tat te algoritm is O(n 3 ). On te oter and, regarding te optimal order, it turns out to be tat te complexity for its computation is n 2 n ( n 1) ( n!) O. (11) n n In te FastMat library te strategy is to use te exaustive approac for nn max and te euristic one. Oterwise, wit n max =5 by default but dynamically configurable by te user. Figure 4: Cost of determination of te optimal order for computing te product of matrices wit te euristic and exaustive (optimal) algoritms. D.2. Example: Computation of SUPG stabilization term. Element residual and Jacobian As an example, consider te computation of te stabilization term (see Section V) for general advective-diffusive systems. Te following product must be computed were R SUPFG, e p p, k 3 k 2 4 A R, (12) 1 SUPFG, gp 371 SUPFG, e R p (sape (n el,n dof ), identifier res) is te SUPG residual contribution from te element e (see Section V and Eq. 15), ω p,k (sape (n dim,n el ), identifier gn) are te spatial gradients of te interpolation functions ω p, A kµν =( F jµ / U ν ) (sape (n el,n dof,n dof ), identifier A) are te Jacobians of te advective fluxes F jµ wit respect to te state variables U ν, τ να (sape (n dof, n dof ), identifier tau) is te matrix of intrinsic times, and SUPFG, gp R (sape (n dof ), identifier R) is te vector of residuals per field at te Gauss point. Tis tensor products arise, for instance, in te context of te FEM-Galerkin SUPG stabilizing metods (see References Donea and Huerta., 2003; Tezduyar and Osawa, 2000). Tis multi-product is just an example of typical computations tat are performed in a FEM based CFD code. Tis operation can be computed in a FastMat call like tis res.prod(gn,a,tau,r, 1,1, 1,2, 2, 2, 3, 3); were if te j-t integer argument is positive it represents te position of te index in te resulting matrix, oterwise if te j-t argument is -1 ten a contraction operation is performed over all tese indices (see Appendix A.C). Te FastMat::prod() metod ten implements several possibilities for te computing tree natural: Te products are performed in te order te user as entered tem, i.e. (((A 1 A 2 )A 3 )A 4 ) euristic: Uses te euristic algoritm described in Section III.D.1. optimal: An exaustive brute-force approac is applied in order to determine te computation tree wit te lowest operation count. In Table 1 te operation counts for tese tree strategies are reported. Te first tree columns sow te relevant dimension parameters. n dim may be 1,2,3 and n el may be 2 (segments), 3 (triangles), 4 (quads in 2D, tetras in 3D) and 8 (exaedra). Te values explored for n dof are n dof =1: scalar advection-diffusion, n dof = ndim +2: compressible flow, n dof = 10: advection-diffusion for 10 species. Operation counts for te computation of element residuals. Te costs of te tensor operation defined in Eq. (12) (were te involved tensors are as described above) are evaluated in terms of te gains (%) (see Table 1). Te gains are related to te cost of products performed in te natural order, so tat CT1 is: ((gn*r)*tau)*a CT2 is: (gn*(tau*r))*a CT3 is: gn*((a*tau)*r) CT4 is: gn*(a*(tau*r)) Operation counts for te computation of element Jacobians. Tis product is similar as te one described above, but now te Jacobian of te residual term is computed so tat te last tensor is not a vector, but rater a rank 3 tensor. J SUPFG, e pqv A J, (13) p, k SUPFG, gp k q

8 Latin American Applied Researc 41: (2011) J.prod(gN,A,tau,JR, 1,1, 1,2, 2, 2, 3, 3,3,4); SUPFG, gp were J (sape (n q dof,n el,n dof ), identifier JR) is te Jacobian of residuals per field at te Gauss point. Possible orders (or computation trees): CT5: ((gn*a)*tau)*jr CT6: (gn*(a*tau))*jr CT7: gn*((a*tau)*jr) Discussion of te influence of computation tree From te previous examples it is noticed tat: In many cases te use of te CT determined wit te euristic or optimal orders yields a significant gain in operation count. Te gain may be 90% or even iger in a realistic case like te computations for Eqs. (12) and (13). In te presented cases, te euristic algoritm yielded always a reduction in operation count, toug in general tis is not guaranteed. In some cases te euristic approac yielded te optimal CT. In oters it gave a gain, but far from te optimal one. It is very interesting tat neiter te euristic nor optimal computation trees are te same for all combinations of parameters (n dim,n el,n dof ). For instance in te computation of te Jacobian te optimal order is (gn*(a*tau))*jr in some cases and gn*((a*tau)*jr) in oters (designated as CT6 and CT7 in te tables). Tis means tat it would be impossible for te user to coose an optimal order for all cases. Tis must be computed automatically at run-time as proposed ere in FastMat library. In same cases te optimal or euristically determined orders involve contractions tat are not in te natural order, for instance te CT (gn*(tau*r))*a (designated as CT2) is obtained wit te euristic algoritm for te computation of te element residual for some set of parameters. Note tat te second sceduled contraction involves te gn and tau*r tensors, even if tey do not sare any contracted indices. Note tat, in order to perform te computation of an efficient CT for te multi-product (wit eiter te euristic or optimal algoritms) it is needed tat te library implements te operation in functional form as in te FastMat library. Operator overloading is not friendly wit te implementation of an algoritm like tis, because tere is no way to capture te wole set of matrices and te contraction indices. Te computation of te optimal or euristic order is done only once and stored in te cace. In fact tis is a very good example of te utility of using caces. Using more elaborated estimations of computing time. In te present work it is assumed tat te computing time is directly proportional to te number of operations. Tis may not be true, but note tat te computation tree could be determined wit a more direct approac. For instance, by bencmarking te products and ten determining te CT tat results in te lowest computing time, not in operation count. 372 IV COMBINING MPI WITH OPENMP To exploit te benefits of bot parallel approaces te socalled ybrid programming paradigm was used earlier wen multi-core processors become available and Massively Parallel Processing (MPP) systems were abandoned in favor of clusters of SMP nodes. In te ybrid model, eac SMP node executes one multi-treaded MPI process (ere tis is done by implementing OpenMP directives) wile in pure MPI programming, eac processor executes a single-treaded MPI process. Te ybrid model approac not always provides many benefits. For example, (Henty, 2000) observed tat altoug in some cases OpenMP was more efficient tan MPI on a single SMP node (flat MPI), ybrid parallelism did not outperform pure message-passing on an SMP cluster. Also, according to Smit and Bull (2001), tis style of programming cannot be regarded as te ideal programming model for all codes. Neverteless, a significant benefit may be obtained if te parallel MPI code suffers from poor scaling due to load imbalance or too fine grained tasks. On one and, if te code scales poorly wit an increasing number of SMP nodes and a multi-treading implementation of a specific part of it scales well, it may be expected an improvement in performance on a mixed code. Tis is because intra-node communication disappears. On te oter and, if an MPI code scales Table 1: Operation count for te stabilization term in te SUPG formulation of advection-diffusion of n dof number of fields. Oter relevant dimensions are te space dimension n dim and te number of nodes per element n el. n dimn eln dof #ops(nat)eur. #ops gain (%) opt. #ops gain (%) CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % Table 2: Operation count for te Jacobian of te SUPG term in te for n dof number of fields. Oter relevant dimensions are te space dimension n dim and te number of nodes per element n el. n dimn eln dof#ops(nat)eur. #ops gain (%) opt. #ops gain (%) CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % CT % poorly due to load imbalance an ybrid programming code version will create a coarser grained problem. Tus im-

9 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN proving te code performance inasmuc as MPI will only be used for communication between nodes. Typically, a mixed mode code involves a ierarcical model were MPI parallelization occurs at top level (coarse grain) and multi-treading parallelization below (fine grain). Tis implies tat MPI routines are called outside parallel regions wile all treads but te master are sleeping. Tis approac provides a portable parallelization and it is te sceme used in tis work. Tread-safe objects, like FastMat ones, are also needed in tis tecnique. Hybrid MPI/OpenMP parallelization In tis work ybrid parallelism is evaluated wit a C++ code for te solution of te scalar advection-diffusion partial differential equation by means of a stabilized finite element metod. Te parallel code was implemented using MPI, PETSc and OpenMP, aving four main parts: i. Problem Partition across SMP nodes: eac node stores a portion of te mes elements and te associated degrees-of-freedom. Besides, a parallel matrix and work vectors are created (MatCreateMPIAIJ, VecCreateM PI). ii. Element FEM matrices and RHS computation: eac SMP node performs a loop over te owned partition of te elements in te mes and computes element matrices and rigt and side contributions using OpenMP treads. Eac tread performs te computations of its assigned group of elements according to a selected OpenMP sceduling (static, dynamic, guided). iii. Matrix/Vector Values Assembly: vector and matrix element contributions are stored in PETSc Matrices and Vectors objects (VecSetValue, MatSetValue) on eac SMP node. After tat, {Vec Mat}Assembly Begin and {Vec Mat}AssemblyEnd are called in order to communicate te off-process vector/matrix contributions and to prepare internal matrix data structures for subsequent operations like parallel matrix-vector product. iv. Linear System Solving: a suitable iterative Krylov Space-based linear solver (KSP objects) is used. Te matrix-vector product is a igly demanding operation of most iterative Krylov metods. Te MatMult operation wic is employed in all PETSc Krylov solvers is also parallelized using OpenMP treads inside an SMP node. For tis purpose, OpenMP directives are introduced at te row loop of te sparse matrix-vector product. V TEST CASE: ADVECTION-DIFFUSION SKEW TO THE MESH Te scalar advection-diffusion equation to be solved is, a u v u s in u u on, (14) n u q D on N D Γ N. Te discrete variational formulation of (14) wit added SUPG stabilizing term (Tezduyar and Osawa, 2000) is written as follows: find u S suc tat were v u d a u d g a a u v u sup e, (15) were u is te unknown scalar, a is te advection velocity vector, ν>0 is te diffusion coefficient and s(x) is a volumetric source term. For te sake of simplicity only Diriclet and Neumann boundary conditions are considered. Prescribed values for Diriclet and Neumann conditions are u D and q, and n is te unit exterior normal to te boundary computation; and in Fig. 8 for te MatMult operation. 373 u nel e e1 sd, V S u u V H andu u D H ( ), u on and 0 on 1 D D e P e ( ), P 1 m s d e ( ) e m e ( ) e, (16) were P m is te finite element interpolating space. According to Tezduyar and Osawa (2000) te stabilization parameter τ supg is computed as wit supg supg, (17) 2 a n supg 2 en s, (18) i i1 were s is a unit vector pointing in te streamline direction. Te problem statement is depicted in Fig. 5 were te computational domain is a unit square, i.e., Ω= (x,y) in [0, 1] [0, 1]. Tis 2D test case as been widely used to illustrate te effectiveness of stabilized finite element metods in te modeling of advection-dominated flows (Donea and Huerta., 2003). Te flow is unidirectional and constant, a =1, but te advection velocity is skew to te mes wit an angle of 30º. Te diffusion coefficient is 10-4, tus obtaining an advection-dominated system. As sown in Fig. 5, Diriclet boundary conditions are imposed at inlet walls (x=0 and y=0 planes) wile at outlet walls (x=1 and y=1 planes) Neumann omogeneous conditions are considered. A solution obtained on a coarse mes is plotted in Fig. 6. A. Results Te equipment used to perform te tests was te Coyote cluster at CIMEC laboratory. Tis is a Beowulf-class cluster, consisting of a server and 6 nodes dual quad-core Intel Xeon E5420 (1333Mz Front Side Bus, FSB) 2.5 GHz CPU wit 8 Gb RAM per node (see Fig. 11.a), interconnected via a Gigabit Eternet network. In order to compute te performance of te proposed algoritm a mes of 4.5 millions of linear triangles (wit linear refinement toward walls) is used. As te system of linear equations arising from (15) is non-symmetric, te Stabilized Bi-Conjugate Gradient (BiCG-Stab) metod wit diagonal preconditioning is employed as solver (Saad., 2000). Te results for te consumed wall-clock time wile sweeping te number of SMP nodes (n p ) and te number of treads (n c ) are sown in Fig. 7 for te RHS and matrix 1

10 Latin American Applied Researc 41: (2011) Let T* be te execution time to solve te problem on 1 node and 1 tread (n p =1 and n c =1) and T denote te execution time of te parallel algoritm wen tere are n p MPI processes eac running n c OpenMP treads. Ten, te speedup S and efficiency E are defined as S T */ T, (19) E S / n n T */ Tn n, (20) p s p s memory (like nodal coordinates and a few pysical parameters) and computing several quantities wit tem (element area, interpolation functions, gradients, element contribution to global FEM matrix, etc). Tus, te ratio between floating point operations and memory accesses is ig. As expected, tis stage is not appreciably influenced by memory bandwidt. Figure 5: Advection of discontinuous inlet data skew to te mes: problem statement. Figure 8: Elapsed time for te MatMult solver operation. Figure 6: SUPG solution for te 2D convection-diffusion problem wit downwind natural conditions in a coarse mes. Figure 9: Speedup for te RHS and matrix computation. On one and, wen considering te left plot of Fig. 9, it can be seen from te blue line labeled ( 1 tread ) tat pure MPI run scaling is linear. On te oter and, if axis line n p =1 is considered (or te blue line labeled 1 node in te rigt plot of te same figure), OpenMP treaded runs also sow linearity. Beyond wat as been said, all ybrid runs exibit te same beavior wit a maximum loss in efficiency of about 10% (for te case of n p =6 and Figure 7: Elapsed time for te RHS and matrix computation. n c =8 treads). respectively. Te speedup obtained for te RHS and matrix computation are illustrated in Fig. 9 and for te Mat ation (Figs. 8 and 10). In tis case, altoug pure Te situation is clearly different for te MatMult oper- Mult operation in Fig. 10. MPI/PETSc runs represented by te blue line labeled 1 As sown troug Figs. 7 to 9, te Matrix and RHS tread in te left part of Fig. 10 sow a linear scaling computation stage scales linearly for te wole range of wen varying te number of SMP processors (or nodes), SMP nodes and numbers of considered cores/treads as te ybrid runs sow an appreciable degradation of speedup. A speedup of 10 is obtained for 6 nodes wit 8 cores, expected. Tis stage is igly parallelizable (i.e., do not require inter-node communications) and te performance wereas te teoretical value is 48. Tis loss in efficiency is only sligtly degraded. Te memory access pattern of can be better appreciated on te rigt plot of te Figure tis stage consists on loading a few element data from were speedup stagnates beyond four treads. In sparse 374

11 R.R: PAZ, M.A. STORTI, H.G. CASTRO, L.C. DALCIN matrix-vector product, te ratio between floating point operations and memory accesses is low. Tus, te overall performance in tis stage is mainly controlled by memory bandwidt. B. OpenMP and Flat MPI on one multi-processor macine A speedup comparison between pure OpenMP and pure MPI versions of te FEM code running on one multiprocessor macine was made using a mes of 2 millions of linear triangles. Figure 10: Speedup for te MatMult iterative solver operation. Figure 12: Speedup for te RHS and matrix computations (left) and MatMult op. (rigt) stages in Xeon arcitecture. Two different arcitectures are used separately: a quadcore Intel i7-950 (3.07Gz) macine and a dual Intel quadcore (dual-die dual-core eac processor) Xeon E5420 (2.5GHz) wit a 1333Mz Front Side Bus (FSB). Common features of all Nealem (and so tat te i7 processor) based processors include an integrated DDR3 memory controller as well as QuickPat Interconnect (QPI) on te processor replacing te Front Side Bus used in earlier Core processors like te Xeon arcitecture. Besides, Nealem processors ave 256KB on-die L2 cace per core, plus up to 12 MB sared on-die L3 cace. QPI is muc faster tan FSB and ence improves te overall performance. Te Core i7 ave an on-die memory controller wic means tat it can access memory muc faster tan te dual core Xeon processors (tat ave an external memory controller), terefore improving te overall performance. Figure 11: dual Xeon and i7 arcs Figure 13: Speedup for te RHS and matrix computations (left) and MatMult op. (rigt) stages in i7 arcitecture. Te ierarcical topology, including memory nodes, sockets, sared caces and cores of te processors involved in tis test are sketced in Fig. 11. Tey were obtained using wloc (wloc, 2010). Figures 12 and 13 sow te speedups obtained on Xeon and i7 arcitectures, respectively, for te matrix and RHS computation and for te iterative system solution stages (MatMult). Performances wen computing te FEM matrix and te RHS are comparable wit tose presented in te previous tests. Regarding te linear system solution stage, as sown in te rigt plot of Fig. 12 and 13, i7 processor performs sligtly better tan Xeon processor considering runs wit varying number of treads (OpenMP) or processes (flat MPI) from one to

12 Latin American Applied Researc 41: (2011) four. Beyond tat, wen running te test using five to eigt treads/ processes in te Xeon processor (recall tat i7 processor as only 4 cores and Xeon as 8 cores), te memory bandwidt saturates and tere is no improvement of performance in te MatMult operation. VI CONCLUSIONS Te computation, assembly and solution stages in typical Finite Element codes ave been discussed and studied in order to exploit te advantages provided by new ybrid arcitectures, like clusters of multi-core processors. For tis purpose an efficient tensor library, wic is tread-safe, is presented and rigorously evaluated. Having in mind tat for large scale meses all FEM computation inside te loop over elements must be evaluated millions of times, te process of cacing all FEM operations at te element level (provides by FastMat library) is a key-point in te performance for te computing/assembling stages. Te most important features of te FastMat tensor library can be summarized as follows: Higly repeated tensorial operations at te element level are caced. Hence, element routines are very fast, in te sense tat almost all te CPU time is spent in performing multiplications and additions, and negligible CPU time is spent in auxiliary operations. Tis feature is te key-point wen dealing wit large/fine FEM meses. For te very common multi-product tensor operation, te order in wic te successive tensor contractions are computed is always performed at te minimum operation count cost depending on te number of tensor/matrices in te FEM terms (exaustive or euristic approaces). Te FastMat library is fully tensorial in te sense tat contractions are implemented in a general way. Te number of contracted tensors and te range of teir indices can be variable. Also, a complete set of classical tensorial functions are available in te library. Te FastMat library is tread-safe, i.e., element residuals and Jacobians can be computed in te context of multi-treaded FEM codes. Also, in tis article, a stabilized finite element formulation of te linear scalar advection-diffusion equation as been implemented. Tis formulation is used to analyze te performance of computing te element contributions and te assemble process using te FastMat library and te multi-treaded version of PETSc s linear equation solver on a cluster of SMP nodes (ybrid arcitecture). Extending pure MPI FEM codes to be used on ybrid clusters wit OpenMP is not a difficult task. It could be only based on te different levels of parallelism of eac part of te code. Implementing MPI/OpenMP ybrid codes starting from treaded codes generally requires muc effort, especially wen considering te linear and non-linear solution stages. FEM code developers and researcers can take advantages from spreading ybrid arcitectures combining MPI and OpenMP standards. Noneteless, as it was sown in tests, it seems tat tere is a limitation on te performance of arcitectures implementing te FSB model, wic is not 376 appreciably improved on i7 arcitectures. Tus, numerical algoritms wit low ratios between floating point operations and memory accesses would perform poorly on ybrid environments. ACKNOWLEDGMENTS Tis work as received financial support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET, Argentina, grant PIP 5271/05), Universidad Nacional del Litoral (UNL, Argentina, grants CAI+D /334), Agencia Nacional de Promoción Científica y Tecnológica (ANPCyT, Argentina, grants PICT 01141/2007, PICT Jóvenes Investigadores, PICT-1506/2006) and Secretaría de Ciencia y Tecnología de la Universidad Tecnológica Nacional, Facultad Regional Resistencia (Caco). APPENDIX A: SYNOPSIS OF FASTMAT OPERATIONS A. One-to-one operations Te operations are from one element of A to te corresponding element in *tis. Te one-to-one operations implemented so far are FastMat& set(const FastMat &A): Copy matrix. FastMat& add(const FastMat &A): Add matrix. FastMat& rest(const FastMat &A): Substract a matrix. FastMat& mult(const FastMat &A): Multiply (element by element) (like Matlab.*). FastMat& div(const FastMat &A): Divide matrix (element by element, like Matlab./). FastMat& axpy(const FastMat &A, const double alpa): Axpy operation (element by element): (*tis) += alpa * A B. In-place operations Tese operations perform an action on all te elements of a matrix. FastMat& set(const double val=0.): Sets all te element of a matrix to a constant value. FastMat& scale(const double val): Scale by a constant value. FastMat& add(const double val): Adds constant val. FastMat& fun(double (*)(double) *f): Apply a function to all elements. C. Generic sum operations (sum over indices) Tese metods perform some associative reduction operation an all te indices of a given dimension resulting in a matrix wic as a lower rank. It s a generalization of te sum/max/min operations in Matlab tat returns te specified operation per columns, resulting in a row vector result (one element per column). Here you specify a number of integer arguments, in suc a way tat if te j-t integer argument is positive it represents te position of te index in te resulting matrix, oterwise if te j-t argument is -1 ten te specified operation (sum/max/min etc...) is performed over all tis index. For instance, if a FastMat A(4,2,2,3,3) is declared ten B.sum(A, 1,2,1, 1) means B A,for i 1..3, j 1..2, (21) ij kjil k1..2, l1..3 Tese operations can be extended to any binary associative operation. So far te following operations are implemented FastMat& sum(const FastMat &A, const int m=0,..): Sum over all selected indices.

Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes

Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes Using Hybrid Parallel Programming Techniques for the Computation, Assembly and Solution Stages in Finite Element Codes Rodrigo R. Paz a,, Mario A. Storti a, Hugo G. Castro a,b, and Lisandro D. Dalcín a,

More information

Numerical Derivatives

Numerical Derivatives Lab 15 Numerical Derivatives Lab Objective: Understand and implement finite difference approximations of te derivative in single and multiple dimensions. Evaluate te accuracy of tese approximations. Ten

More information

Fast Calculation of Thermodynamic Properties of Water and Steam in Process Modelling using Spline Interpolation

Fast Calculation of Thermodynamic Properties of Water and Steam in Process Modelling using Spline Interpolation P R E P R N T CPWS XV Berlin, September 8, 008 Fast Calculation of Termodynamic Properties of Water and Steam in Process Modelling using Spline nterpolation Mattias Kunick a, Hans-Joacim Kretzscmar a,

More information

3.6 Directional Derivatives and the Gradient Vector

3.6 Directional Derivatives and the Gradient Vector 288 CHAPTER 3. FUNCTIONS OF SEVERAL VARIABLES 3.6 Directional Derivatives and te Gradient Vector 3.6.1 Functions of two Variables Directional Derivatives Let us first quickly review, one more time, te

More information

Cubic smoothing spline

Cubic smoothing spline Cubic smooting spline Menu: QCExpert Regression Cubic spline e module Cubic Spline is used to fit any functional regression curve troug data wit one independent variable x and one dependent random variable

More information

Linear Interpolating Splines

Linear Interpolating Splines Jim Lambers MAT 772 Fall Semester 2010-11 Lecture 17 Notes Tese notes correspond to Sections 112, 11, and 114 in te text Linear Interpolating Splines We ave seen tat ig-degree polynomial interpolation

More information

Asynchronous Power Flow on Graphic Processing Units

Asynchronous Power Flow on Graphic Processing Units 1 Asyncronous Power Flow on Grapic Processing Units Manuel Marin, Student Member, IEEE, David Defour, and Federico Milano, Senior Member, IEEE Abstract Asyncronous iterations can be used to implement fixed-point

More information

2 The Derivative. 2.0 Introduction to Derivatives. Slopes of Tangent Lines: Graphically

2 The Derivative. 2.0 Introduction to Derivatives. Slopes of Tangent Lines: Graphically 2 Te Derivative Te two previous capters ave laid te foundation for te study of calculus. Tey provided a review of some material you will need and started to empasize te various ways we will view and use

More information

Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs

Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs Parallel Simulation of Equation-Based Models on CUDA-Enabled GPUs Per Ostlund Department of Computer and Information Science Linkoping University SE-58183 Linkoping, Sweden per.ostlund@liu.se Kristian

More information

Our Calibrated Model has No Predictive Value: An Example from the Petroleum Industry

Our Calibrated Model has No Predictive Value: An Example from the Petroleum Industry Our Calibrated Model as No Predictive Value: An Example from te Petroleum Industry J.N. Carter a, P.J. Ballester a, Z. Tavassoli a and P.R. King a a Department of Eart Sciences and Engineering, Imperial

More information

Piecewise Polynomial Interpolation, cont d

Piecewise Polynomial Interpolation, cont d Jim Lambers MAT 460/560 Fall Semester 2009-0 Lecture 2 Notes Tese notes correspond to Section 4 in te text Piecewise Polynomial Interpolation, cont d Constructing Cubic Splines, cont d Having determined

More information

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science A Cost Model for Distributed Sared Memory Using Competitive Update Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, Texas, 77843-3112, USA E-mail: fjkim,vaidyag@cs.tamu.edu

More information

THANK YOU FOR YOUR PURCHASE!

THANK YOU FOR YOUR PURCHASE! THANK YOU FOR YOUR PURCHASE! Te resources included in tis purcase were designed and created by me. I ope tat you find tis resource elpful in your classroom. Please feel free to contact me wit any questions

More information

Haar Transform CS 430 Denbigh Starkey

Haar Transform CS 430 Denbigh Starkey Haar Transform CS Denbig Starkey. Background. Computing te transform. Restoring te original image from te transform 7. Producing te transform matrix 8 5. Using Haar for lossless compression 6. Using Haar

More information

Computer Physics Communications. Multi-GPU acceleration of direct pore-scale modeling of fluid flow in natural porous media

Computer Physics Communications. Multi-GPU acceleration of direct pore-scale modeling of fluid flow in natural porous media Computer Pysics Communications 183 (2012) 1890 1898 Contents lists available at SciVerse ScienceDirect Computer Pysics Communications ournal omepage: www.elsevier.com/locate/cpc Multi-GPU acceleration

More information

Optimal In-Network Packet Aggregation Policy for Maximum Information Freshness

Optimal In-Network Packet Aggregation Policy for Maximum Information Freshness 1 Optimal In-etwork Packet Aggregation Policy for Maimum Information Fresness Alper Sinan Akyurek, Tajana Simunic Rosing Electrical and Computer Engineering, University of California, San Diego aakyurek@ucsd.edu,

More information

Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number

Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number Bounding Tree Cover Number and Positive Semidefinite Zero Forcing Number Sofia Burille Mentor: Micael Natanson September 15, 2014 Abstract Given a grap, G, wit a set of vertices, v, and edges, various

More information

Symmetric Tree Replication Protocol for Efficient Distributed Storage System*

Symmetric Tree Replication Protocol for Efficient Distributed Storage System* ymmetric Tree Replication Protocol for Efficient Distributed torage ystem* ung Cune Coi 1, Hee Yong Youn 1, and Joong up Coi 2 1 cool of Information and Communications Engineering ungkyunkwan University

More information

Chapter K. Geometric Optics. Blinn College - Physics Terry Honan

Chapter K. Geometric Optics. Blinn College - Physics Terry Honan Capter K Geometric Optics Blinn College - Pysics 2426 - Terry Honan K. - Properties of Ligt Te Speed of Ligt Te speed of ligt in a vacuum is approximately c > 3.0µ0 8 mês. Because of its most fundamental

More information

13.5 DIRECTIONAL DERIVATIVES and the GRADIENT VECTOR

13.5 DIRECTIONAL DERIVATIVES and the GRADIENT VECTOR 13.5 Directional Derivatives and te Gradient Vector Contemporary Calculus 1 13.5 DIRECTIONAL DERIVATIVES and te GRADIENT VECTOR Directional Derivatives In Section 13.3 te partial derivatives f x and f

More information

More on Functions and Their Graphs

More on Functions and Their Graphs More on Functions and Teir Graps Difference Quotient ( + ) ( ) f a f a is known as te difference quotient and is used exclusively wit functions. Te objective to keep in mind is to factor te appearing in

More information

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method Engineering, 2013, 5, 522-526 ttp://dx.doi.org/10.4236/eng.2013.510b107 Publised Online October 2013 (ttp://www.scirp.org/journal/eng) Two Modifications of Weigt Calculation of te Non-Local Means Denoising

More information

19.2 Surface Area of Prisms and Cylinders

19.2 Surface Area of Prisms and Cylinders Name Class Date 19 Surface Area of Prisms and Cylinders Essential Question: How can you find te surface area of a prism or cylinder? Resource Locker Explore Developing a Surface Area Formula Surface area

More information

Unsupervised Learning for Hierarchical Clustering Using Statistical Information

Unsupervised Learning for Hierarchical Clustering Using Statistical Information Unsupervised Learning for Hierarcical Clustering Using Statistical Information Masaru Okamoto, Nan Bu, and Tosio Tsuji Department of Artificial Complex System Engineering Hirosima University Kagamiyama

More information

CESILA: Communication Circle External Square Intersection-Based WSN Localization Algorithm

CESILA: Communication Circle External Square Intersection-Based WSN Localization Algorithm Sensors & Transducers 2013 by IFSA ttp://www.sensorsportal.com CESILA: Communication Circle External Square Intersection-Based WSN Localization Algoritm Sun Hongyu, Fang Ziyi, Qu Guannan College of Computer

More information

1.4 RATIONAL EXPRESSIONS

1.4 RATIONAL EXPRESSIONS 6 CHAPTER Fundamentals.4 RATIONAL EXPRESSIONS Te Domain of an Algebraic Epression Simplifying Rational Epressions Multiplying and Dividing Rational Epressions Adding and Subtracting Rational Epressions

More information

Tuning MAX MIN Ant System with off-line and on-line methods

Tuning MAX MIN Ant System with off-line and on-line methods Université Libre de Bruxelles Institut de Recerces Interdisciplinaires et de Développements en Intelligence Artificielle Tuning MAX MIN Ant System wit off-line and on-line metods Paola Pellegrini, Tomas

More information

Multi-Stack Boundary Labeling Problems

Multi-Stack Boundary Labeling Problems Multi-Stack Boundary Labeling Problems Micael A. Bekos 1, Micael Kaufmann 2, Katerina Potika 1 Antonios Symvonis 1 1 National Tecnical University of Atens, Scool of Applied Matematical & Pysical Sciences,

More information

PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION

PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION PYRAMID FILTERS BASED ON BILINEAR INTERPOLATION Martin Kraus Computer Grapics and Visualization Group, Tecnisce Universität Müncen, Germany krausma@in.tum.de Magnus Strengert Visualization and Interactive

More information

Asynchronous Power Flow on Graphic Processing Units

Asynchronous Power Flow on Graphic Processing Units Asyncronous Power Flow on Grapic Processing Units Manuel Marin, David Defour, Federico Milano To cite tis version: Manuel Marin, David Defour, Federico Milano. Asyncronous Power Flow on Grapic Processing

More information

ICES REPORT Isogeometric Analysis of Boundary Integral Equations

ICES REPORT Isogeometric Analysis of Boundary Integral Equations ICES REPORT 5-2 April 205 Isogeometric Analysis of Boundary Integral Equations by Mattias Taus, Gregory J. Rodin and Tomas J. R. Huges Te Institute for Computational Engineering and Sciences Te University

More information

Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid Flows

Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid Flows Journal of Computational Pysics 176, 40 69 (2002) doi:10.1006/jcp.2001.6967, available online at ttp://www.idealibrary.com on Grid Adaptation for Functional Outputs: Application to Two-Dimensional Inviscid

More information

MAPI Computer Vision

MAPI Computer Vision MAPI Computer Vision Multiple View Geometry In tis module we intend to present several tecniques in te domain of te 3D vision Manuel Joao University of Mino Dep Industrial Electronics - Applications -

More information

Alternating Direction Implicit Methods for FDTD Using the Dey-Mittra Embedded Boundary Method

Alternating Direction Implicit Methods for FDTD Using the Dey-Mittra Embedded Boundary Method Te Open Plasma Pysics Journal, 2010, 3, 29-35 29 Open Access Alternating Direction Implicit Metods for FDTD Using te Dey-Mittra Embedded Boundary Metod T.M. Austin *, J.R. Cary, D.N. Smite C. Nieter Tec-X

More information

4.1 Tangent Lines. y 2 y 1 = y 2 y 1

4.1 Tangent Lines. y 2 y 1 = y 2 y 1 41 Tangent Lines Introduction Recall tat te slope of a line tells us ow fast te line rises or falls Given distinct points (x 1, y 1 ) and (x 2, y 2 ), te slope of te line troug tese two points is cange

More information

An Algorithm for Loopless Deflection in Photonic Packet-Switched Networks

An Algorithm for Loopless Deflection in Photonic Packet-Switched Networks An Algoritm for Loopless Deflection in Potonic Packet-Switced Networks Jason P. Jue Center for Advanced Telecommunications Systems and Services Te University of Texas at Dallas Ricardson, TX 75083-0688

More information

Intra- and Inter-Session Network Coding in Wireless Networks

Intra- and Inter-Session Network Coding in Wireless Networks Intra- and Inter-Session Network Coding in Wireless Networks Hulya Seferoglu, Member, IEEE, Atina Markopoulou, Member, IEEE, K K Ramakrisnan, Fellow, IEEE arxiv:857v [csni] 3 Feb Abstract In tis paper,

More information

Coarticulation: An Approach for Generating Concurrent Plans in Markov Decision Processes

Coarticulation: An Approach for Generating Concurrent Plans in Markov Decision Processes Coarticulation: An Approac for Generating Concurrent Plans in Markov Decision Processes Kasayar Roanimanes kas@cs.umass.edu Sridar Maadevan maadeva@cs.umass.edu Department of Computer Science, University

More information

12.2 Techniques for Evaluating Limits

12.2 Techniques for Evaluating Limits 335_qd /4/5 :5 PM Page 863 Section Tecniques for Evaluating Limits 863 Tecniques for Evaluating Limits Wat ou sould learn Use te dividing out tecnique to evaluate its of functions Use te rationalizing

More information

CS 234. Module 6. October 16, CS 234 Module 6 ADT Dictionary 1 / 33

CS 234. Module 6. October 16, CS 234 Module 6 ADT Dictionary 1 / 33 CS 234 Module 6 October 16, 2018 CS 234 Module 6 ADT Dictionary 1 / 33 Idea for an ADT Te ADT Dictionary stores pairs (key, element), were keys are distinct and elements can be any data. Notes: Tis is

More information

Redundancy Awareness in SQL Queries

Redundancy Awareness in SQL Queries Redundancy Awareness in QL Queries Bin ao and Antonio Badia omputer Engineering and omputer cience Department University of Louisville bin.cao,abadia @louisville.edu Abstract In tis paper, we study QL

More information

An Anchor Chain Scheme for IP Mobility Management

An Anchor Chain Scheme for IP Mobility Management An Ancor Cain Sceme for IP Mobility Management Yigal Bejerano and Israel Cidon Department of Electrical Engineering Tecnion - Israel Institute of Tecnology Haifa 32000, Israel E-mail: bej@tx.tecnion.ac.il.

More information

Density Estimation Over Data Stream

Density Estimation Over Data Stream Density Estimation Over Data Stream Aoying Zou Dept. of Computer Science, Fudan University 22 Handan Rd. Sangai, 2433, P.R. Cina ayzou@fudan.edu.cn Ziyuan Cai Dept. of Computer Science, Fudan University

More information

4.2 The Derivative. f(x + h) f(x) lim

4.2 The Derivative. f(x + h) f(x) lim 4.2 Te Derivative Introduction In te previous section, it was sown tat if a function f as a nonvertical tangent line at a point (x, f(x)), ten its slope is given by te it f(x + ) f(x). (*) Tis is potentially

More information

Investigating an automated method for the sensitivity analysis of functions

Investigating an automated method for the sensitivity analysis of functions Investigating an automated metod for te sensitivity analysis of functions Sibel EKER s.eker@student.tudelft.nl Jill SLINGER j..slinger@tudelft.nl Delft University of Tecnology 2628 BX, Delft, te Neterlands

More information

THE POSSIBILITY OF ESTIMATING THE VOLUME OF A SQUARE FRUSTRUM USING THE KNOWN VOLUME OF A CONICAL FRUSTRUM

THE POSSIBILITY OF ESTIMATING THE VOLUME OF A SQUARE FRUSTRUM USING THE KNOWN VOLUME OF A CONICAL FRUSTRUM THE POSSIBILITY OF ESTIMATING THE VOLUME OF A SQUARE FRUSTRUM USING THE KNOWN VOLUME OF A CONICAL FRUSTRUM SAMUEL OLU OLAGUNJU Adeyemi College of Education NIGERIA Email: lagsam04@aceondo.edu.ng ABSTRACT

More information

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method International Journal of Statistics and Applications 0, (): -0 DOI: 0.9/j.statistics.000.0 Comparison of te Efficiency of te Various Algoritms in Stratified Sampling wen te Initial Solutions are Determined

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE Data Structures and Algoritms Capter 4: Trees (AVL Trees) Text: Read Weiss, 4.4 Izmir University of Economics AVL Trees An AVL (Adelson-Velskii and Landis) tree is a binary searc tree wit a balance

More information

You Try: A. Dilate the following figure using a scale factor of 2 with center of dilation at the origin.

You Try: A. Dilate the following figure using a scale factor of 2 with center of dilation at the origin. 1 G.SRT.1-Some Tings To Know Dilations affect te size of te pre-image. Te pre-image will enlarge or reduce by te ratio given by te scale factor. A dilation wit a scale factor of 1> x >1enlarges it. A dilation

More information

Network Coding to Enhance Standard Routing Protocols in Wireless Mesh Networks

Network Coding to Enhance Standard Routing Protocols in Wireless Mesh Networks Downloaded from vbn.aau.dk on: April 7, 09 Aalborg Universitet etwork Coding to Enance Standard Routing Protocols in Wireless Mes etworks Palevani, Peyman; Roetter, Daniel Enrique Lucani; Fitzek, Frank;

More information

Fault Localization Using Tarantula

Fault Localization Using Tarantula Class 20 Fault localization (cont d) Test-data generation Exam review: Nov 3, after class to :30 Responsible for all material up troug Nov 3 (troug test-data generation) Send questions beforeand so all

More information

15-122: Principles of Imperative Computation, Summer 2011 Assignment 6: Trees and Secret Codes

15-122: Principles of Imperative Computation, Summer 2011 Assignment 6: Trees and Secret Codes 15-122: Principles of Imperative Computation, Summer 2011 Assignment 6: Trees and Secret Codes William Lovas (wlovas@cs) Karl Naden Out: Tuesday, Friday, June 10, 2011 Due: Monday, June 13, 2011 (Written

More information

Announcements SORTING. Prelim 1. Announcements. A3 Comments 9/26/17. This semester s event is on Saturday, November 4 Apply to be a teacher!

Announcements SORTING. Prelim 1. Announcements. A3 Comments 9/26/17. This semester s event is on Saturday, November 4 Apply to be a teacher! Announcements 2 "Organizing is wat you do efore you do someting, so tat wen you do it, it is not all mixed up." ~ A. A. Milne SORTING Lecture 11 CS2110 Fall 2017 is a program wit a teac anyting, learn

More information

UUV DEPTH MEASUREMENT USING CAMERA IMAGES

UUV DEPTH MEASUREMENT USING CAMERA IMAGES ABCM Symposium Series in Mecatronics - Vol. 3 - pp.292-299 Copyrigt c 2008 by ABCM UUV DEPTH MEASUREMENT USING CAMERA IMAGES Rogerio Yugo Takimoto Graduate Scool of Engineering Yokoama National University

More information

MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2

MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2 MATH 5a Spring 2018 READING ASSIGNMENTS FOR CHAPTER 2 Note: Tere will be a very sort online reading quiz (WebWork) on eac reading assignment due one our before class on its due date. Due dates can be found

More information

H-Adaptive Multiscale Schemes for the Compressible Navier-Stokes Equations Polyhedral Discretization, Data Compression and Mesh Generation

H-Adaptive Multiscale Schemes for the Compressible Navier-Stokes Equations Polyhedral Discretization, Data Compression and Mesh Generation H-Adaptive Multiscale Scemes for te Compressible Navier-Stokes Equations Polyedral Discretization, Data Compression and Mes Generation F. Bramkamp 1, B. Gottsclic-Müller 2, M. Hesse 1, P. Lamby 2, S. Müller

More information

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Fall Has-Based Indexes Capter 11 Comp 521 Files and Databases Fall 2012 1 Introduction Hasing maps a searc key directly to te pid of te containing page/page-overflow cain Doesn t require intermediate page fetces

More information

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring

Hash-Based Indexes. Chapter 11. Comp 521 Files and Databases Spring Has-Based Indexes Capter 11 Comp 521 Files and Databases Spring 2010 1 Introduction As for any index, 3 alternatives for data entries k*: Data record wit key value k

More information

Efficient Content-Based Indexing of Large Image Databases

Efficient Content-Based Indexing of Large Image Databases Efficient Content-Based Indexing of Large Image Databases ESSAM A. EL-KWAE University of Nort Carolina at Carlotte and MANSUR R. KABUKA University of Miami Large image databases ave emerged in various

More information

Energy efficient temporal load aware resource allocation in cloud computing datacenters

Energy efficient temporal load aware resource allocation in cloud computing datacenters Vakilinia Journal of Cloud Computing: Advances, Systems and Applications (2018) 7:2 DOI 10.1186/s13677-017-0103-2 Journal of Cloud Computing: Advances, Systems and Applications RESEARCH Energy efficient

More information

Image Registration via Particle Movement

Image Registration via Particle Movement Image Registration via Particle Movement Zao Yi and Justin Wan Abstract Toug fluid model offers a good approac to nonrigid registration wit large deformations, it suffers from te blurring artifacts introduced

More information

Section 2.3: Calculating Limits using the Limit Laws

Section 2.3: Calculating Limits using the Limit Laws Section 2.3: Calculating Limits using te Limit Laws In previous sections, we used graps and numerics to approimate te value of a it if it eists. Te problem wit tis owever is tat it does not always give

More information

ANTENNA SPHERICAL COORDINATE SYSTEMS AND THEIR APPLICATION IN COMBINING RESULTS FROM DIFFERENT ANTENNA ORIENTATIONS

ANTENNA SPHERICAL COORDINATE SYSTEMS AND THEIR APPLICATION IN COMBINING RESULTS FROM DIFFERENT ANTENNA ORIENTATIONS NTNN SPHRICL COORDINT SSTMS ND THIR PPLICTION IN COMBINING RSULTS FROM DIFFRNT NTNN ORINTTIONS llen C. Newell, Greg Hindman Nearfield Systems Incorporated 133. 223 rd St. Bldg. 524 Carson, C 9745 US BSTRCT

More information

Truncated Newton-based multigrid algorithm for centroidal Voronoi diagram calculation

Truncated Newton-based multigrid algorithm for centroidal Voronoi diagram calculation NUMERICAL MATHEMATICS: Teory, Metods and Applications Numer. Mat. Teor. Met. Appl., Vol. xx, No. x, pp. 1-18 (200x) Truncated Newton-based multigrid algoritm for centroidal Voronoi diagram calculation

More information

Excel based finite difference modeling of ground water flow

Excel based finite difference modeling of ground water flow Journal of Himalaan Eart Sciences 39(006) 49-53 Ecel based finite difference modeling of ground water flow M. Gulraiz Akter 1, Zulfiqar Amad 1 and Kalid Amin Kan 1 Department of Eart Sciences, Quaid-i-Azam

More information

Proceedings. Seventh ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC 2013) Palm Spring, CA

Proceedings. Seventh ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC 2013) Palm Spring, CA Proceedings Of te Sevent ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC ) Palm Spring, CA October 9 November st Parameter-Unaware Autocalibration for Occupancy Mapping David Van

More information

The Euler and trapezoidal stencils to solve d d x y x = f x, y x

The Euler and trapezoidal stencils to solve d d x y x = f x, y x restart; Te Euler and trapezoidal stencils to solve d d x y x = y x Te purpose of tis workseet is to derive te tree simplest numerical stencils to solve te first order d equation y x d x = y x, and study

More information

Some Handwritten Signature Parameters in Biometric Recognition Process

Some Handwritten Signature Parameters in Biometric Recognition Process Some Handwritten Signature Parameters in Biometric Recognition Process Piotr Porwik Institute of Informatics, Silesian Uniersity, Bdziska 39, 41- Sosnowiec, Poland porwik@us.edu.pl Tomasz Para Institute

More information

Vector Processing Contours

Vector Processing Contours Vector Processing Contours Andrey Kirsanov Department of Automation and Control Processes MAMI Moscow State Tecnical University Moscow, Russia AndKirsanov@yandex.ru A.Vavilin and K-H. Jo Department of

More information

MAC-CPTM Situations Project

MAC-CPTM Situations Project raft o not use witout permission -P ituations Project ituation 20: rea of Plane Figures Prompt teacer in a geometry class introduces formulas for te areas of parallelograms, trapezoids, and romi. e removes

More information

Integrating Multimedia Applications in Hard Real-Time Systems

Integrating Multimedia Applications in Hard Real-Time Systems Integrating Multimedia Applications in Hard Real-Time Systems Luca Abeni and Giorgio Buttazzo Scuola Superiore S. Anna, Pisa luca@arti.sssup.it, giorgio@sssup.it Abstract Tis paper focuses on te problem

More information

An Effective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Chiming Huang and Rei-Heng Cheng

An Effective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Chiming Huang and Rei-Heng Cheng An ffective Sensor Deployment Strategy by Linear Density Control in Wireless Sensor Networks Ciming Huang and ei-heng Ceng 5 De c e mbe r0 International Journal of Advanced Information Tecnologies (IJAIT),

More information

Minimizing Memory Access By Improving Register Usage Through High-level Transformations

Minimizing Memory Access By Improving Register Usage Through High-level Transformations Minimizing Memory Access By Improving Register Usage Troug Hig-level Transformations San Li Scool of Computer Engineering anyang Tecnological University anyang Avenue, SIGAPORE 639798 Email: p144102711@ntu.edu.sg

More information

12.2 TECHNIQUES FOR EVALUATING LIMITS

12.2 TECHNIQUES FOR EVALUATING LIMITS Section Tecniques for Evaluating Limits 86 TECHNIQUES FOR EVALUATING LIMITS Wat ou sould learn Use te dividing out tecnique to evaluate its of functions Use te rationalizing tecnique to evaluate its of

More information

RECONSTRUCTING OF A GIVEN PIXEL S THREE- DIMENSIONAL COORDINATES GIVEN BY A PERSPECTIVE DIGITAL AERIAL PHOTOS BY APPLYING DIGITAL TERRAIN MODEL

RECONSTRUCTING OF A GIVEN PIXEL S THREE- DIMENSIONAL COORDINATES GIVEN BY A PERSPECTIVE DIGITAL AERIAL PHOTOS BY APPLYING DIGITAL TERRAIN MODEL IV. Évfolyam 3. szám - 2009. szeptember Horvát Zoltán orvat.zoltan@zmne.u REONSTRUTING OF GIVEN PIXEL S THREE- DIMENSIONL OORDINTES GIVEN Y PERSPETIVE DIGITL ERIL PHOTOS Y PPLYING DIGITL TERRIN MODEL bsztrakt/bstract

More information

Computing geodesic paths on manifolds

Computing geodesic paths on manifolds Proc. Natl. Acad. Sci. USA Vol. 95, pp. 8431 8435, July 1998 Applied Matematics Computing geodesic pats on manifolds R. Kimmel* and J. A. Setian Department of Matematics and Lawrence Berkeley National

More information

, 1 1, A complex fraction is a quotient of rational expressions (including their sums) that result

, 1 1, A complex fraction is a quotient of rational expressions (including their sums) that result RT. Complex Fractions Wen working wit algebraic expressions, sometimes we come across needing to simplify expressions like tese: xx 9 xx +, xx + xx + xx, yy xx + xx + +, aa Simplifying Complex Fractions

More information

Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks

Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks Traffic Pattern-based Adaptive Routing for Intra-group Communication in Dragonfly Networks Peyman Faizian, Md Safayat Raman, Md Atiqul Molla, Xin Yuan Department of Computer Science Florida State University

More information

UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING

UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING UNSUPERVISED HIERARCHICAL IMAGE SEGMENTATION BASED ON THE TS-MRF MODEL AND FAST MEAN-SHIFT CLUSTERING Raffaele Gaetano, Giuseppe Scarpa, Giovanni Poggi, and Josiane Zerubia Dip. Ing. Elettronica e Telecomunicazioni,

More information

PROTOTYPE OF LOAD CELL APPLICATION IN TORQE MEASUREMENT

PROTOTYPE OF LOAD CELL APPLICATION IN TORQE MEASUREMENT IMEKO 21 TC3, TC5 and TC22 Conferences Metrology in Modern Context November 22 25, 21, Pattaya, Conburi, Tailand PROTOTYPE OF LOAD CELL APPLICATION IN TORQE MEASUREMENT Tassanai Sanponpute 1, Cokcai Wattong

More information

Section 3. Imaging With A Thin Lens

Section 3. Imaging With A Thin Lens Section 3 Imaging Wit A Tin Lens 3- at Ininity An object at ininity produces a set o collimated set o rays entering te optical system. Consider te rays rom a inite object located on te axis. Wen te object

More information

A Novel QC-LDPC Code with Flexible Construction and Low Error Floor

A Novel QC-LDPC Code with Flexible Construction and Low Error Floor A Novel QC-LDPC Code wit Flexile Construction and Low Error Floor Hanxin WANG,2, Saoping CHEN,2,CuitaoZHU,2 and Kaiyou SU Department of Electronics and Information Engineering, Sout-Central University

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

On the Use of Radio Resource Tests in Wireless ad hoc Networks

On the Use of Radio Resource Tests in Wireless ad hoc Networks Tecnical Report RT/29/2009 On te Use of Radio Resource Tests in Wireless ad oc Networks Diogo Mónica diogo.monica@gsd.inesc-id.pt João Leitão jleitao@gsd.inesc-id.pt Luis Rodrigues ler@ist.utl.pt Carlos

More information

A Finite Element Scheme for Calculating Inverse Dynamics of Link Mechanisms

A Finite Element Scheme for Calculating Inverse Dynamics of Link Mechanisms WCCM V Fift World Congress on Computational Mecanics July -1,, Vienna, Austria Eds.: H.A. Mang, F.G. Rammerstorfer, J. Eberardsteiner A Finite Element Sceme for Calculating Inverse Dynamics of Link Mecanisms

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Lehrstuhl für Informatik 10 (Systemsimulation)

Lehrstuhl für Informatik 10 (Systemsimulation) FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lerstul für Informatik 10 (Systemsimulation) PDE based Video Compression in Real

More information

AVL Trees Outline and Required Reading: AVL Trees ( 11.2) CSE 2011, Winter 2017 Instructor: N. Vlajic

AVL Trees Outline and Required Reading: AVL Trees ( 11.2) CSE 2011, Winter 2017 Instructor: N. Vlajic 1 AVL Trees Outline and Required Reading: AVL Trees ( 11.2) CSE 2011, Winter 2017 Instructor: N. Vlajic AVL Trees 2 Binary Searc Trees better tan linear dictionaries; owever, te worst case performance

More information

PLK-B SERIES Technical Manual (USA Version) CLICK HERE FOR CONTENTS

PLK-B SERIES Technical Manual (USA Version) CLICK HERE FOR CONTENTS PLK-B SERIES Technical Manual (USA Version) CLICK ERE FOR CONTENTS CONTROL BOX PANEL MOST COMMONLY USED FUNCTIONS INITIAL READING OF SYSTEM SOFTWARE/PAGES 1-2 RE-INSTALLATION OF TE SYSTEM SOFTWARE/PAGES

More information

Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art

Multi-Objective Particle Swarm Optimizers: A Survey of the State-of-the-Art Multi-Objective Particle Swarm Optimizers: A Survey of te State-of-te-Art Margarita Reyes-Sierra and Carlos A. Coello Coello CINVESTAV-IPN (Evolutionary Computation Group) Electrical Engineering Department,

More information

Distributed and Optimal Rate Allocation in Application-Layer Multicast

Distributed and Optimal Rate Allocation in Application-Layer Multicast Distributed and Optimal Rate Allocation in Application-Layer Multicast Jinyao Yan, Martin May, Bernard Plattner, Wolfgang Mülbauer Computer Engineering and Networks Laboratory, ETH Zuric, CH-8092, Switzerland

More information

Overcomplete Steerable Pyramid Filters and Rotation Invariance

Overcomplete Steerable Pyramid Filters and Rotation Invariance vercomplete Steerable Pyramid Filters and Rotation Invariance H. Greenspan, S. Belongie R. Goodman and P. Perona S. Raksit and C. H. Anderson Department of Electrical Engineering Department of Anatomy

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks

Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks Utilizing Call Admission Control to Derive Optimal Pricing of Multiple Service Classes in Wireless Cellular Networks Okan Yilmaz and Ing-Ray Cen Computer Science Department Virginia Tec {oyilmaz, ircen}@vt.edu

More information

TREES. General Binary Trees The Search Tree ADT Binary Search Trees AVL Trees Threaded trees Splay Trees B-Trees. UNIT -II

TREES. General Binary Trees The Search Tree ADT Binary Search Trees AVL Trees Threaded trees Splay Trees B-Trees. UNIT -II UNIT -II TREES General Binary Trees Te Searc Tree DT Binary Searc Trees VL Trees Treaded trees Splay Trees B-Trees. 2MRKS Q& 1. Define Tree tree is a data structure, wic represents ierarcical relationsip

More information

Real-Time Wireless Routing for Industrial Internet of Things

Real-Time Wireless Routing for Industrial Internet of Things Real-Time Wireless Routing for Industrial Internet of Tings Cengjie Wu, Dolvara Gunatilaka, Mo Sa, Cenyang Lu Cyber-Pysical Systems Laboratory, Wasington University in St. Louis Department of Computer

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

Feature-Based Steganalysis for JPEG Images and its Implications for Future Design of Steganographic Schemes

Feature-Based Steganalysis for JPEG Images and its Implications for Future Design of Steganographic Schemes Feature-Based Steganalysis for JPEG Images and its Implications for Future Design of Steganograpic Scemes Jessica Fridric Dept. of Electrical Engineering, SUNY Bingamton, Bingamton, NY 3902-6000, USA fridric@bingamton.edu

More information

Introduction to Computer Graphics 5. Clipping

Introduction to Computer Graphics 5. Clipping Introduction to Computer Grapics 5. Clipping I-Cen Lin, Assistant Professor National Ciao Tung Univ., Taiwan Textbook: E.Angel, Interactive Computer Grapics, 5 t Ed., Addison Wesley Ref:Hearn and Baker,

More information

A UPnP-based Decentralized Service Discovery Improved Algorithm

A UPnP-based Decentralized Service Discovery Improved Algorithm Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol.1, No.1, Marc 2013, pp. 21~26 ISSN: 2089-3272 21 A UPnP-based Decentralized Service Discovery Improved Algoritm Yu Si-cai*, Wu Yan-zi,

More information