Research on the Implementation of MPI on Multicore Architectures

Size: px
Start display at page:

Download "Research on the Implementation of MPI on Multicore Architectures"

Transcription

1 Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China Yan Gu Department of Computer Science & Technology, Tshinghua University, Beijing, China Abstract We first introduced the multicore-oriented optimization modules of two common MPI implementations MPICH2 and OpenMPI, and then tested their performances on one multicore computer. By enabling and disabling these modules, we provided their performances, including bandwidth and latency, under different circumstances. Finally, we analyzed the two MPI implementations and discussed the choice of MPI implementations and possible improvements. Keywords-Message Passing Interface, Multicore, MPICH2, OpenMPI, Intranode Communication. I. INTRODUCTION As CPU frequency has stalled at about 3-4 GHz for a long time, multicore architectures are more and more widely used for better performance on a single computer. However, as a communications protocol, MPI is initially designed for distributed memory systems. Unlike OpenMP, it does not allow shared data. Instead, MPI programs transfer data by message passing. Because of the memory wall, when there are thousands of cores on one computer, message passing will be more efficient than shared data accessing. Thus, it is important for MPI implementations to increase the speed of data communication. There are many commonly used open-source implementations, such as MPICH2 and OpenMPI. To fully exploit multicore architectures, these implementations may use some new technologies. The purpose of this project is to study the multicore consciousness of such implementations, analysis their performances, and make improvements if possible. II. MULTICORE CONSCIOUSNESS OF MPI IMPLEMENTATIONS A. Thread-Level Parallelism As traditionally used for computer cluster, major MPI implementations, including both MPICH2 and OpenMPI, focus on parallelism more on process level than on thread level. For multicore nodes, theoretically, lowering the parallelization level ( process fusion ) can improve the performance. But for a better scalability, MPI does not provide a convenient way of thread-level communication. Instead, the thread-level control is left to the user. Pthreads, OpenMP, or other thread parallelization methods are available inside each MPI process. To enable multithreaded programming, MPI Init thread() [3] should be called in place of MPI Init(). After that, thread parallelization like OpenMP directives can be used. Both implementations support this way of hybrid programming. It leaves the programmer much freedom to decide the behavior of all processes and threads. The programmer can choose private or shared data properly to reach the peak performance. Nevertheless, it makes the programming nearer to the hardware rather than the algorithm itself. Moreover, existed MPI code cannot easily migrated to multicore architecture in this way. B. MPICH2 Nemesis Starting with MPICH2 1.1 series, the default channel is ch3:nemesis. So in the article, we mainly discuss Nemesis [6] for MPICH2. The Nemesis communication subsystem provides an efficient scalable way of both intranode and internode communications. In Nemesis, each process has only one lock-free receive queue. When one process needs to send message, it dequeues a free element from a lock-free free queue, fills this element with message, and then enqueues it onto the receiver s receive queue. The receiving operation is just the reverse of sending dequeuing from the receive queue, handling the message and enqueuing back to the original free queue. Here is the figure of the mechanism of message queue: 1 Free 6 Fill 2 packet Recv Sending process Free 3 Recv Handle 4 packet 5 Receiving process There are three variations of the location of free queue: 1) One global free queue 2) One free queue per process that will be dequeued by other processes while sending to it 3) One free queue per process that will be dequeued by the process itself while sending messages out

2 The first one is better for small-scale shared-memory architecture a multicore node. Since free queue access of processes are not balanced, this method improves memory utilization. The other two are mainly designed for large-scale distributed-memory architecture like NUMA, for decreasing remote memory access latency. They only need either the sender or the receiver to access remote memory. The Nemesis implementation uses the third variation for large-scale clusters. This variation can be implemented with multipleenqueuer single-dequeuer lock-free queues. Here is the pseudo-code of enqueuing and dequeuing, with atomic swap (SWAP) and compare-and-swap (CAS) operations: Enqueue(queue, element) prev=swap(queue->tail,element); if (prev==null) queue->head=element; else prev->next=element; Dequeue(queue,&element) element=queue->head; if (element->next!=null) queue->head=element->next; else queue->head=null; old=cas(queue->tail,element,null); if (old!=element) while (element->next==null) SKIP; queue->head=element->next; In addition, there are some optimizations for faster intranode communication: 1) Reducing L2 Cache Misses Memory is much slower than cache on modern computers, so it is critical to reduce L2 cache misses. While enqueuing onto an empty queue or dequeuing the last element in a queue, a process has to access both the head and tail of the queue. In these cases, there is only one L2 cache miss if the head and tail were in the same cache line. Otherwise, either the head or the tail is accessed. Thus, there would be L2 cache misses because of false sharing if the head and tail were in the same cache line. Based on the discussion above, there is not a best way to decide the placement of the head and tail. In fact, Nemesis puts them in the same cache line and uses a shadow head pointer (initialized to NULL) in another cache line. The dequeuer first checks the shadow. If the shadow is not NULL, the dequeuer directly uses the shadow head. Otherwise, it checks the real head. If the real head is not NULL, meaning some elements have been enqueued, the dequeuer copy the real head to the shadow and then set the real head to NULL. In this way, L2 cache misses only occurred when enqueuing onto an empty queue or dequeuing from a queue with only one element. 2) Bypassing Queues There is another technique using fastboxes to decrease latency. The fastbox is a single buffer, one per pair of processes. The sender puts the message into the fastbox if it is empty, rather than the queue. Similarly, the receiver gets the message from the fastbox if full. Fastboxes can improve the performance of intranode communication. However, this method lacks scalability so that it cannot be applied to global large sharedmemory. Also, it requires the receiver to check multiple fastboxes. This can be partly avoided by specifying the sender in Nemesis. Besides, this implementaion may change the order of message sending/receiving. To address the problem, Nemesis uses a sequence number to keep the original order. 3) Memory Copy Nemesis uses assembly string copy functions and MMX instructions, which is more efficient than standard libc memcpy function. 4) Large Message Transfer The shared-memory queue discussed above is efficient for transferring small messages. However, for large messages, this method is not a good solution. Therefore, the Large Message Transfer (LMT) interface is added into CH3. This can increase the transfer bandwidth and decrease the impact on the applications data in the cache. The LMT interface uses the rendezvous protocol. Unlike the original eager protocol, it ensures the receiver is matched before sending so that the sender does not need to take more memory for unsent messages. The protocol threshold is 32KB by default. The procedure of the two protocols can be shown in this figure: Sender Receiver Sender Receiver Send Eager Protocol RNDZ_START RNDZ_REPLY DATA FIN Rendezvous Protocol For intranode, LMT copies through buffer in shared memory. Using double-buffering, copying from the sender to the buffer and from the buffer to the receiver can be concurrent. 5) Bypassing the Posted Receive Queue

3 While receiving, traditional CH3 implementation checks all the message queue and wait if there is no message matches current message to be received. With this optimization, if no message matches, CH3 will not wait. Instead, it checks other receive requests. If it finds one matched pair, it can receive message. C. OpenMPI sm BTL There is an equivalent of Nemesis in OpenMPI sm BTL (shared-memory Byte Transfer Layer), which is a lowlatency, high-bandwidth mechanism for transferring data between two processes via shared memory. It can only be used between processes on the same node. The sm BTL transfers fragments of messages broken up by the PML (Point-to-point Message Management Layer). The steps are: [4] The sender fills a shared-memory fragment out of one of its free lists. Each process has one free list for smaller fragments and another for larger fragments. The sender packs the user-message fragment into this shared-memory fragment. The sender posts a pointer to this shared fragment into the appropriate FIFO queue of the receiver. The receiver polls its FIFO(s). When it finds a new fragment pointer, it unpacks data out of the sharedmemory fragment and notifies the sender that the shared fragment is ready for reuse (to be returned to the sender s free list). On each node where an MPI job has two or more processes running, the job creates a file that each process mmaps into its address space. Shared-memory resources that the job needs such as FIFOs and fragment free lists are allocated from this shared-memory area. D. KNEM The KNEM (Kernel Nemesis) is a Linux kernel module enabling high-performance intranode MPI communication for large messages. The LMT module of MPICH2 (since 1.1.1) and the sm BTL of OpenMPI (since 1.5) use KNEM to improve intranode communications. On a single node, both Nemesis and sm BTL use a buffer for copying messages. It performs well for small messages when the number of cores is not large. As the size of message and the number of cores expand, this solution will be too slow. Other potential problems include cache pollution, high CPU use, etc. For better scalability and performance, KNEM lowers the data sharing from user space to kernel space. Here is how KNEM works: [5] The sender declares a send buffer to KNEM. KNEM tells the sender the virtual segments contained in the buffer, with a unique cookie. The sender passes the cookie to the receiver. The receiver gives KNEM its required cookie and where its receive buffer is. KNEM finishes the copy. Send Buffer Sender LMT Send Cmd (1) Send Cmd List Inter LMT Communication (3) Cookie (2) Recv Cmd (4) Acquire Send Cmd (5) Recv Buffer Receiver LMT Copy (6) Obviously, KNEM saves one copy, which is efficient for large messages and many-core systems. These operations are more complicated than double buffer copying (the system call overhead is about 100ns [2]) so that KNEM should not be applied to small messages. More details of KNEM can be found in [7]. By the way, KNEM can also improve communication with the Intel R I/O Acceleration Technology (I/OAT) [1] technology, using DMA to transmitting data. A. Platform III. EXPERIMENT Hardware CPU Intel Core i5 CPU 2.67GHz, 4 Cores Cache 32+32KB L1 per core, 256KB L2 per core, 8MB L3 shared Memory 4GB 1333MHz Software OS Arch Linux x86-64 with Kernel Compiler GCC MPICH , compiled with -O2 No LMT / LMT / LMT + KNEM OpenMPI 1.5.1, compiled with -O2 KNEM support enabled/disabled KNEM 0.9.4, compiled with -O2 Benchmark OSU Micro-Benchmarks 3.2 compiled with mpicc -O3 Processes 2 processes for one-to-one test 4 for others Due to hardware limitations, the KNEM does not enable DMA and I/OAT. B. Results 1) Bandwidth Test

4 2) Latency Test C. Analysis From the figures above, we see that: In most cases, Nemesis (without LMT/KNEM) is the best for small messages while sm BTL is the best for large messages. The watershed is about 16KB. For messages between 16KB and 4MB, KNEM really accelerates sm BTL. But on the contrary, for messages over 4MB, KNEM in fact makes sm BTL slower. After 4MB, the message size is larger than the L3 cache. Because the OSU Micro-Benchmark always sends and receives one piece of memory, cache misses only occur when the message size is larger than cache. On the memory level, due to different implementations, sm BTL may be better. For example, Possibly sm BTL uses assembly code but KNEM does not. If DMA is enabled, maybe KNEM can perform better. The LMT gives a better performance than that of

5 original Nemesis only when the message size is enough. For different tests, the threshold can vary from 32KB to 256KB. In general, the more concurrent memory accesses are, the smaller the threshold will be, since more accesses cost more memory and cache space for unsent messages. The reasons why there are steep slopes at 32KB in these figures is because LMT is not enabled for messages smaller than 32KB. For small messages, the combination of KNEM and LMT is the slowest one. Unlike sm BTL, KNEM shows its advantages for messages larger than cache, which demonstrates that the efficiency of copying is: sm BTL > KNEM > LMT. Since there are only four cores on this computer, there is not much difference between one-to-one and all-to-all tests. However, they will be significantly distinct with more cores. [4] FAQ of sm BTL. [5] D. Buntinas, B. Goglin, D. Goodell, G. Mercier and S. Moreaud. Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis. Proceedings of ICPP, [6] D. Buntinas, G. Mercier and W. Gropp. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. CCGRID, [7] T. Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres and J. Dongarra. Kernel assisted collective intra-node communication among multicore and manycore CPUs. INRIA, IV. CONCLUSION AND POSSIBLE IMPROVEMENTS Use MPICH2 for programs with frequent small message passings; use OpenMPI when messages are large. If the sizes of messages are large enough, use KNEM to accelerate message passing. These conclusions are based on our platform. If the cache is not shared, there are more cores, or DMA can be enabled, the performances can be different. It is highly recommended to do a similar test before deciding the MPI implementation for use. In conclusion, there are some possible improvements as follow: For multicore architecture, each node actually executes multiple processes. Thus, it is unnecessary that every process has its own free queue. All processes in one node can share one free queue, which is good for load balance. However, the prerequisite is that the location of processes should be known, like process fusion. Therefore, it needs either some modification of source code or dynamic queue allocation at runtime. The rendezvous protocol in LMT is driven by the sender. The sender cannot send messages before the receiver returns that it is ready. The latency of twice transferring can be very large. To avoid this, the receiver can tell the sender immediately when it needs a message. The sender then sends a message if it finds a matched receiver for the message. This improvement can reduce one message passing. But it is obviously increases the overhead of the sender. So its effectiveness highly depends on hardware platform and network condition. [1] Intel I/OAT webpage. [2] KNEM website. REFERENCES [3] Linux manual page of MPI_Init_thread.

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory

More information

Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem

Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem Author manuscript, published in "Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), Singapour : Singapour (2006)" DOI : 0.09/CCGRID.2006.3 Design and Evaluation of Nemesis,

More information

Optimizing communications on clusters of multicores

Optimizing communications on clusters of multicores Optimizing communications on clusters of multicores Alexandre DENIS with contributions from: Elisabeth Brunet, Brice Goglin, Guillaume Mercier, François Trahay Runtime Project-team INRIA Bordeaux - Sud-Ouest

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Lei Chai Ping Lai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda Department of Computer Science and Engineering

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

The LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University

The LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University The LiMIC Strikes Back Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University jinh@konkuk.ac.kr Contents MPI Intra-node Communication Introduction of LiMIC

More information

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E I N F - 3981 M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E Efficient Intra-node Communication for Chip Multiprocessors Torje S. Henriksen October, 2008 FACULTY OF SCIENCE Department of Computer

More information

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption Marc Pérache, Patrick Carribault, and Hervé Jourdren CEA, DAM, DIF F-91297 Arpajon, France {marc.perache,patrick.carribault,herve.jourdren}@cea.fr

More information

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW Teng Ma, Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra Innovative Computing Laboratory, EECS, University

More information

KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework

KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework Brice Goglin, Stéphanie Moreaud To cite this version: Brice Goglin, Stéphanie Moreaud. KNEM: a Generic and Scalable Kernel-Assisted

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors University of Crete School of Sciences & Engineering Computer Science Department Master Thesis by Michael Papamichael Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Data Transfer in a SMP System: Study and Application to MPI

Data Transfer in a SMP System: Study and Application to MPI Data Transfer in a SMP System: Study and Application to MPI Darius Buntinas, Guillaume Mercier, William Gropp To cite this version: Darius Buntinas, Guillaume Mercier, William Gropp. Data Transfer in a

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation Miao Luo, Ping Lai, Sreeram Potluri, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Dhabaleswar K. Panda Department of Computer

More information

Understanding MPI on Cray XC30

Understanding MPI on Cray XC30 Understanding MPI on Cray XC30 MPICH3 and Cray MPT Cray MPI uses MPICH3 distribution from Argonne Provides a good, robust and feature rich MPI Cray provides enhancements on top of this: low level communication

More information

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor Carsten Clauss, Stefan Lankes, Pablo Reble, Thomas Bemmerl International Workshop on New Algorithms and Programming

More information

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio

More information

Optimizing MPI Communication within large Multicore nodes with Kernel assistance

Optimizing MPI Communication within large Multicore nodes with Kernel assistance Optimizing MPI Communication within large Multicore nodes with Kernel assistance Stéphanie Moreaud, Brice Goglin, David Goodell, Raymond Namyst To cite this version: Stéphanie Moreaud, Brice Goglin, David

More information

Optimising MPI for Multicore Systems

Optimising MPI for Multicore Systems Optimising MPI for Multicore Systems Fall 2014 Instructor: Dr. Ming Hwa Wang Santa Clara University Submitted by: Akshaya Shenoy Ramya Jagannath Suraj Pulla (Team 7 (Previously Team 8)) 1 ACKNOWLEDGEMENT

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

Buffer Management for XFS in Linux. William J. Earl SGI

Buffer Management for XFS in Linux. William J. Earl SGI Buffer Management for XFS in Linux William J. Earl SGI XFS Requirements for a Buffer Cache Delayed allocation of disk space for cached writes supports high write performance Delayed allocation main memory

More information

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intra-MIC MPI Communication using MVAPICH2: Early Experience Intra-MIC MPI Communication using MVAPICH: Early Experience Sreeram Potluri, Karen Tomko, Devendar Bureddy, and Dhabaleswar K. Panda Department of Computer Science and Engineering Ohio State University

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Buffered-Mode MPI Implementation for the Cell BE Processor A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti

More information

Maximizing Memory Performance for ANSYS Simulations

Maximizing Memory Performance for ANSYS Simulations Maximizing Memory Performance for ANSYS Simulations By Alex Pickard, 2018-11-19 Memory or RAM is an important aspect of configuring computers for high performance computing (HPC) simulation work. The performance

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Efficient Shared Memory Message Passing for Inter-VM Communications

Efficient Shared Memory Message Passing for Inter-VM Communications Efficient Shared Memory Message Passing for Inter-VM Communications François Diakhaté 1, Marc Perache 1,RaymondNamyst 2, and Herve Jourdren 1 1 CEA DAM Ile de France 2 University of Bordeaux Abstract.

More information

A Non-Blocking Concurrent Queue Algorithm

A Non-Blocking Concurrent Queue Algorithm A Non-Blocking Concurrent Queue Algorithm Bruno Didot bruno.didot@epfl.ch June 2012 Abstract This report presents a new non-blocking concurrent FIFO queue backed by an unrolled linked list. Enqueue and

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University Presentation Outline Introduction

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358

Memory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358 Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand Device Programming Nima Honarmand read/write interrupt read/write Spring 2017 :: CSE 506 Device Interface (Logical View) Device Interface Components: Device registers Device Memory DMA buffers Interrupt

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

First-In-First-Out (FIFO) Algorithm

First-In-First-Out (FIFO) Algorithm First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:

More information

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com

More information

Hybrid MPI - A Case Study on the Xeon Phi Platform

Hybrid MPI - A Case Study on the Xeon Phi Platform Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory

More information

Chapter 8: Virtual Memory. Operating System Concepts

Chapter 8: Virtual Memory. Operating System Concepts Chapter 8: Virtual Memory Silberschatz, Galvin and Gagne 2009 Chapter 8: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating

More information

System Design for a Million TPS

System Design for a Million TPS System Design for a Million TPS Hüsnü Sensoy Global Maksimum Data & Information Technologies Global Maksimum Data & Information Technologies Focused just on large scale data and information problems. Complex

More information

Grand Central Dispatch

Grand Central Dispatch A better way to do multicore. (GCD) is a revolutionary approach to multicore computing. Woven throughout the fabric of Mac OS X version 10.6 Snow Leopard, GCD combines an easy-to-use programming model

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

Designing Shared Address Space MPI libraries in the Many-core Era

Designing Shared Address Space MPI libraries in the Many-core Era Designing Shared Address Space MPI libraries in the Many-core Era Jahanzeb Hashmi hashmi.29@osu.edu (NBCL) The Ohio State University Outline Introduction and Motivation Background Shared-memory Communication

More information

OPERATING SYSTEM. Chapter 9: Virtual Memory

OPERATING SYSTEM. Chapter 9: Virtual Memory OPERATING SYSTEM Chapter 9: Virtual Memory Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel Memory

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18 PROCESS VIRTUAL MEMORY CS124 Operating Systems Winter 2015-2016, Lecture 18 2 Programs and Memory Programs perform many interactions with memory Accessing variables stored at specific memory locations

More information

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan GPUfs: Integrating a File System with GPUs Yishuai Li & Shreyas Skandan Von Neumann Architecture Mem CPU I/O Von Neumann Architecture Mem CPU I/O slow fast slower Direct Memory Access Mem CPU I/O slow

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Oracle Database 12c: JMS Sharded Queues

Oracle Database 12c: JMS Sharded Queues Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Introduction to Operating Systems. Chapter Chapter

Introduction to Operating Systems. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency Anders Gidenstam Håkan Sundell Philippas Tsigas School of business and informatics University of Borås Distributed

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

RDMA-like VirtIO Network Device for Palacios Virtual Machines

RDMA-like VirtIO Network Device for Palacios Virtual Machines RDMA-like VirtIO Network Device for Palacios Virtual Machines Kevin Pedretti UNM ID: 101511969 CS-591 Special Topics in Virtualization May 10, 2012 Abstract This project developed an RDMA-like VirtIO network

More information

CS370 Operating Systems Midterm Review

CS370 Operating Systems Midterm Review CS370 Operating Systems Midterm Review Yashwant K Malaiya Fall 2015 Slides based on Text by Silberschatz, Galvin, Gagne 1 1 What is an Operating System? An OS is a program that acts an intermediary between

More information

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Cray XMT Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems Next Generation Cray XMT Goals Memory System Improvements

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Learning with Purpose

Learning with Purpose Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

a process may be swapped in and out of main memory such that it occupies different regions

a process may be swapped in and out of main memory such that it occupies different regions Virtual Memory Characteristics of Paging and Segmentation A process may be broken up into pieces (pages or segments) that do not need to be located contiguously in main memory Memory references are dynamically

More information

Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective

Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective Bertrand Putigny, Benoit Ruelle, Brice Goglin To cite this version: Bertrand Putigny, Benoit Ruelle, Brice Goglin.

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition

Chapter 7: Main Memory. Operating System Concepts Essentials 8 th Edition Chapter 7: Main Memory Operating System Concepts Essentials 8 th Edition Silberschatz, Galvin and Gagne 2011 Chapter 7: Memory Management Background Swapping Contiguous Memory Allocation Paging Structure

More information

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs

More information

MPI on the Cray XC30

MPI on the Cray XC30 MPI on the Cray XC30 Aaron Vose 4/15/2014 Many thanks to Cray s Nick Radcliffe and Nathan Wichmann for slide ideas. Cray MPI. MPI on XC30 - Overview MPI Message Pathways. MPI Environment Variables. Environment

More information

NUMA-Aware Shared-Memory Collective Communication for MPI

NUMA-Aware Shared-Memory Collective Communication for MPI NUMA-Aware Shared-Memory Collective Communication for MPI Shigang Li Torsten Hoefler Marc Snir Presented By: Shafayat Rahman Motivation Number of cores per node keeps increasing So it becomes important

More information

Introduction to Operating. Chapter Chapter

Introduction to Operating. Chapter Chapter Introduction to Operating Systems Chapter 1 1.3 Chapter 1.5 1.9 Learning Outcomes High-level understand what is an operating system and the role it plays A high-level understanding of the structure of

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

CS307: Operating Systems

CS307: Operating Systems CS307: Operating Systems Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building 3-513 wuct@cs.sjtu.edu.cn Download Lectures ftp://public.sjtu.edu.cn

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

The Art and Science of Memory Allocation

The Art and Science of Memory Allocation Logical Diagram The Art and Science of Memory Allocation Don Porter CSE 506 Binary Formats RCU Memory Management Memory Allocators CPU Scheduler User System Calls Kernel Today s Lecture File System Networking

More information

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24 FILE SYSTEMS, PART 2 CS124 Operating Systems Fall 2017-2018, Lecture 24 2 Last Time: File Systems Introduced the concept of file systems Explored several ways of managing the contents of files Contiguous

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Design of Concurrent and Distributed Data Structures

Design of Concurrent and Distributed Data Structures METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < > Adaptive Lock Madhav Iyengar < miyengar@andrew.cmu.edu >, Nathaniel Jeffries < njeffrie@andrew.cmu.edu > ABSTRACT Busy wait synchronization, the spinlock, is the primitive at the core of all other synchronization

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Performance of Variant Memory Configurations for Cray XT Systems

Performance of Variant Memory Configurations for Cray XT Systems Performance of Variant Memory Configurations for Cray XT Systems Wayne Joubert, Oak Ridge National Laboratory ABSTRACT: In late 29 NICS will upgrade its 832 socket Cray XT from Barcelona (4 cores/socket)

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,

More information

Efficient shared memory message passing for inter-vm communications

Efficient shared memory message passing for inter-vm communications Efficient shared memory message passing for inter-vm communications François Diakhaté, Marc Pérache, Raymond Namyst, Hervé Jourdren To cite this version: François Diakhaté, Marc Pérache, Raymond Namyst,

More information

Enhance Data De-Duplication Performance With Multi-Thread Chunking Algorithm. December 9, Xinran Jiang, Jia Zhao, Jie Zheng

Enhance Data De-Duplication Performance With Multi-Thread Chunking Algorithm. December 9, Xinran Jiang, Jia Zhao, Jie Zheng Enhance Data De-Duplication Performance With Multi-Thread Chunking Algorithm This paper is submitted in partial fulfillment of the requirements for Operating System Class (COEN 283) Santa Clara University

More information

Four Components of a Computer System

Four Components of a Computer System Four Components of a Computer System Operating System Concepts Essentials 2nd Edition 1.1 Silberschatz, Galvin and Gagne 2013 Operating System Definition OS is a resource allocator Manages all resources

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information