MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces
|
|
- Sylvia Wiggins
- 5 years ago
- Views:
Transcription
1 MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,
2 CONTENTS Introduction Background & Motivation MiAMI Performance Measurement Conclusions & Future Work
3 INTRODUCTION Multi-Core Processors Leveraging throughput and scalability Boosting concurrent processes Parallel programs Network server programs x Top 500 Processor Generation June 2009
4 INTRODUCTION High-Speed Interconnects Gigabit Ethernet 10 gigabit networks InfiniBand Myrinet 10 Gigabit Ethernet Top 500 Interconnect Family June 2009 Multiple Network Interfaces Cost-effective high network bandwidth High availability Tsukuba Univ. MegaProto/E
5 MOTIVATION Are multi-core processors being utilized efficiently for multiple network interfaces? Awareness of multi-core processors Support for single to multiple network interfaces Transparency for existing middleware and applications
6 ITEM #1: Multi-Core Processors Cache Layout PACKAGE 0 PACKAGE 1 CORE 0 CORE 1 CORE 2 CORE 3 CORE 4 CORE 5 CORE 6 CORE 7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 2-Way Quad-Core SMP (Intel Clovertown) Processor Affinity Process affinity Mapping between processor and process Interrupt affinity Mapping between processor and device (interrupt)
7 ITEM #1: Multi-Core Processors TASK TASK TASK TASK TASK TASK TASK TASK PACKAGE 0 PACKAGE 1 CO RE0 CO RE1 CO RE2 CO RE3 CO RE4 CO RE5 CO RE6 CO RE7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 IRQ Gigabit Ethernet NIC 2-Way Quad-Core SMP (Intel Clovertown) Process Affinity Bad (Other Affinity) Not Bad (Cache Affinity) Good (Core Affinity)
8 ITEM #2: Multiple Network Interfaces Multi-port network interface Silicom, Chelsio, Intel, etc. Receive-Side Scaling (RSS) Enables packet receive-processing to scale with the number of available processors Intel, Chelsio, etc. Self-Virtualized I/O Devices Virtual interfaces
9 ITEM #2: Multiple Network Interfaces Tricky Issues Relative optimal process affinity P 0 P n P n+1 P k+1? P k Processes using multiple interfaces? P k Core0 Core1 Core0 Core1 NIC0 NIC1 NIC0 Cache NIC1 Generalization for N network interfaces (N 1)
10 ITEM #3: Transparency autopin and Vtune Static tuning sched_setaffinity() System Call Application-level modifications SyMMer Middleware-level modifications (e.g., MPI) Kernel-Level Process Scheduling Considerations for multi-core processors Considerations for N network interfaces
11 MiAMI Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Basic Ideas Multi-core awareness Cache layout and interrupt affinity aware process scheduling N network interfaces Arbitration between processes Transparency Adaptive kernel-level scheduling
12 OVERALL DESIGN Communication System Calls Communication Load Monitor Trigger Set Get /proc File System Configuration Tool Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Parameters Timer Scheduling Agent
13 SYSTEM INFORMATION Interrupt Affinity To know optimal core Cache Layout To decide sub-optimal cores Communication Load Monitor Communication Intensiveness InterruptAffinity Cache Layout Processor Load Processor Loads Configuration Tool To deal with the tradeoff efficiently Optimal process affinity Processor overloading Scheduling Agent
14 COMMUNICATION LOAD MONITOR Traces the communication intensiveness Recent communication history Intensiveness = 10 6 / Average Average The average value of recent communication intervals Range from 1 to 10 6 Intensiveness The higher value means higher Communication Load Monitor Communication Intensiveness communication intensiveness Range from 1 to 10 6 Cache Layout Interrupt Affinity Processor Load Configuration Tool Scheduling Agent
15 COMMUNICATION INTENSIVENESS Bandwidth(Mbps) Communication Intensiveness Interval between Send System Calls (us) Interval between Send System Calls (us)
16 COMMUNICATION INTENSIVENESS Communication Intensiveness Tables Three classes High Middle Low Roughly sorted lists Can insert into the list in O(1) Still can choose a (sub)optimal victim to migrate into an idle/optimal core Dynamic class boundaries Adjusted by the scheduling agent
17 SCHEDULING AGENT Input Parameters Communication intensiveness Cache layout Interrupt affinity Processors load Three Levels of Affinity Core-level affinity Cache-level affinity Other-level affinity Configuration Tool Communication Load Monitor Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Scheduling Agent Basic Policy Move the communication-intensive processes to the core-level affinity state as much as possible if the processor load allows
18 SCHEDULING POLICY Step1: Releases idle networking processes Scheduled by the legacy Linux process scheduler Step2: Moves communication-intensive processes Other-level or cache-level -> core-level affinity Other-level -> cache-level affinity Step3: Mitigate the processor s load Adjusts the class boundaries Moves less communication-intensive processes into idle cores
19 PROCESS AFFINITY STATES Core Affinity Cache Affinity Processor Overload & No Cache-Level Cores Processor Underload & Cache-Level Cores Other Affinity Communication Intensive & Subptimal Core is Idle No Affinity Communication Intensive & Optimal Core is Idle
20 DATA STRUCTURES Core 0 Core 1 Core 2 Core n miami_cpu System Information Cache Layer Processor Load Process Affinity Core-Level Affinity Cache-Level Affinity Other-Level Affinity Intensiveness Classes High Middle Low High Middle Low High Middle Low MiAMI List miami_node PCB ptr Time stamp History Intnsvns TCP/IP Control Block PCB ptr Time stamp History Intnsvns PCB ptr Time stamp History Intnsvns Process Control Block Process Runqueue
21 EXPERIMENTAL SYSTEM PACKAGE PACKAGE COR E0 COR E1 COR E2 COR E3 COR E4 COR E5 COR E6 COR E7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 8-Port NIC 2-Way Quad-Core Processor (Intel Xeon Clovertown and AMD Opteron Barcelona) 24 NIC Ports 2 NIC4 NIC 2 NIC 2 NIC2 NIC Ports Ports Ports Ports Ports 2S 4N 2N 4S Case 42 NIC Ports 2 NIC Ports Two 4-port 1GigE NICs (Silicom PXG4)
22 MICROBENCHMARK Large Volume Data Transmission (Bandwidth Test) From 8 To 32 TCP Connections
23 MICROBENCHMARK: SMP CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU Net Net MiAMI-Net MiAMI can save more processor resources (up to 65%) compared with the others
24 SCENARIO-BASED BENCHMARK 1 Large Volume Data Transmission: 8 TCP Connections Ping-Pong Communication: 8 to 24 TCP Connections
25 SCENARIO-BASED BENCHMARK Improvement (%) Improvement (%) ,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 8,8 8, ,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 8,8 8,24 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Processes / Interrupt Affinity # of Processes / Interrupt Affinity Bandwidth Latency Processor Utilization Bandwidth Latency Processor Utilization <SMP System> <NUMA System> MiAMI can prevent frequent migration and deliver better performance
26 SCENARIO-BASED BENCHMARK 2 Process Type Communication Interval Start Point Number of Processes A 50us 0s 2 B 20us 10s 2 C 0us 20s 1 0s 50us 10s 20s 20us
27 SCENARIO-BASED BENCHMARK Improvement (%) Improvement (%) Aggregate Bandwidth Bandwidth of Process C <SMP System> Processor Utilization -10 Aggregate Bandwidth Bandwidth of Process C <NUMA System> Processor Utilization MiAMI can figure out which connection is the most communication intensive and decide the optimal distribution of networking processes over multiple cores
28 CONCLUSIONS MiAMI Multi-core aware networking process scheduling Generalization for multiple network interfaces Transparent scheduling Experimental Results Processor utilization SMP : up to 65% NUMA : up to 63% Network performance More than 30% with less processor resources
29 FUTURE WORK Performance Measurement Real applications Web servers Parallel file systems Various scenarios for multiple interfaces RSS Self-virtualized network device Extension of MiAMI Better scheduling policies Other I/O peripheral devices
30 Thanks! Hyun-Wook (JIN) Jin System Software Laboratory Konkuk University
31 NETWORK DATA PATH User Level Kernel Level Kernel Buffer Device Level Physical Level User Application Socket Send System Call Device Driver Device Network Protocol User Application Socket Receive System Call Kernel Buffer Bottom Half(Softirq) And Network Protocol Soft Interrupt Device Driver (Interrupt Handler) Hard Interrupt Device
32 ITEM #1: Multi-Core Processors PACKAGE 0 PACKAGE 1 CO RE0 CO RE1 CO RE2 CO RE3 CO RE4 CO RE5 CO RE6 CO RE7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 2-Way Quad-Core NUMA (AMD Barcelona)
33 CONFIGURATION TOOL The user-level MiAMI interfaces To set up the policies Behavior parameters The interval of scheduling agent Communication Load Monitor Communication Intensiveness Cache Layout Interrupt Affinity Processor Load Configuration Tool Scheduling Agent
34 COMMUNICATION LOAD MONITOR PCB Communication System Calls COMMUNICATION LOAD MONITOR TCB A TCB B TCB Z SCHEDULING AGENT A B Y Z Low Intensiveness Low Intensiveness High Intensiveness High Intensiveness
35 EXPERIMENTAL SYSTEM PACKAGE 0 PACKAGE 1 COR E0 COR E1 COR E2 COR E3 COR E4 COR E5 COR E6 COR E7 L1 L1 L1 L1 L2 L2 L2 L2 L3 L1 L1 L1 L1 L2 L2 L2 L2 L3 RAM RAM 2-Way AMD Opteron 2356 processors (Barcelona Quad-Core) 4GB RAM Dual onboard NVIDIA MCP55 NICs One HotLava 6CGNIC-X 6-port 1GigE NIC 24 NIC Ports 2 NIC4 NIC 2 NIC 2 NIC2 NIC Ports Ports Ports Ports Ports 2S 4N 2N 4S Case 42 NIC Ports 2 NIC Ports
36 MICROBENCHMARK: SMP CPU Utilization (%) Aggregate Bandwidth (Mbps) CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU Net Net MiAMI-Net CPU CPU MiAMI-CPU Net Net MiAMI-Net <Sending Execution of SMP> <Receiving Execution of SMP>
37 MICROBENCHMARK: NUMA CPU Utilization (%) Aggregate Bandwidth (Mbps) CPU Utilization (%) Aggregate Bandwidth (Mbps) 2S 2N 4S 4N IrqB 2S 2N 4S 4N IrqB # of Connections / Interrupt Affinity # of Connections / Interrupt Affinity CPU CPU MiAMI-CPU CPU CPU MiAMI-CPU Net Net MiAMI-Net Net Net MiAMI-Net <Sending Execution of NUMA> <Receiving Execution of NUMA> MiAMI can save more processor resources compared with the others
38 What we are missing Optimal priority Core-level affinity High, middle, and low Cache-level affinity High, middle, and low Other-level affinity High, middle, and low Optimal equation for evaluating the communication intensiveness Data volume Tradeoff between migration overhead and optimal affinity Optimal timer resolution Scheduling agent Optimal interrupt affinity Static is not sufficient IrqB is not sufficient Optimal thresholds Processor overloading Boundaries for comm. Intensiveness # of migration
Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access
Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access S. Moreaud, B. Goglin and R. Namyst INRIA Runtime team-project University of Bordeaux, France Context Multicore architectures everywhere
More informationLinux Kernel Hacking Free Course
Linux Kernel Hacking Free Course 3 rd edition G.Grilli, University of me Tor Vergata IRQ DISTRIBUTION IN MULTIPROCESSOR SYSTEMS April 05, 2006 IRQ distribution in multiprocessor systems 1 Contents: What
More informationLiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster
LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster H. W. Jin, S. Sur, L. Chai, and D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering
More informationThe LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University
The LiMIC Strikes Back Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University jinh@konkuk.ac.kr Contents MPI Intra-node Communication Introduction of LiMIC
More informationAdvanced RDMA-based Admission Control for Modern Data-Centers
Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline
More informationProfiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education
More informationOptimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications
Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda Network Based Computing Laboratory
More informationDesigning An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems
Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems Lei Chai Ping Lai Hyun-Wook Jin Dhabaleswar K. Panda Department of Computer Science
More informationImproving Packet Processing Efficiency on Multi-core Architectures with Single Input Queue
P. Orosz. / Carpathian Journal of Electronic and Computer Engineering 5 (2012) 44-48 44 Improving Packet Processing Efficiency on Multi-core Architectures with Single Input Queue Péter Orosz University
More informationOptimizing Performance: Intel Network Adapters User Guide
Optimizing Performance: Intel Network Adapters User Guide Network Optimization Types When optimizing network adapter parameters (NIC), the user typically considers one of the following three conditions
More informationThe Power of Batching in the Click Modular Router
The Power of Batching in the Click Modular Router Joongi Kim, Seonggu Huh, Keon Jang, * KyoungSoo Park, Sue Moon Computer Science Dept., KAIST Microsoft Research Cambridge, UK * Electrical Engineering
More informationMellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007
Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationReducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet Pilar González-Férez and Angelos Bilas 31 th International Conference on Massive Storage Systems
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationReal-time scheduling for virtual machines in SK Telecom
Real-time scheduling for virtual machines in SK Telecom Eunkyu Byun Cloud Computing Lab., SK Telecom Sponsored by: & & Cloud by Virtualization in SKT Provide virtualized ICT infra to customers like Amazon
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationChelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch
PERFORMANCE BENCHMARKS Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch Chelsio Communications www.chelsio.com sales@chelsio.com +1-408-962-3600 Executive Summary Ethernet provides a reliable
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More informationMemory Management Strategies for Data Serving with RDMA
Memory Management Strategies for Data Serving with RDMA Dennis Dalessandro and Pete Wyckoff (presenting) Ohio Supercomputer Center {dennis,pw}@osc.edu HotI'07 23 August 2007 Motivation Increasing demands
More informationCan Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems? Sayantan Sur, Abhinav Vishnu, Hyun-Wook Jin, Wei Huang and D. K. Panda {surs, vishnu, jinhy, huanwei, panda}@cse.ohio-state.edu
More informationA Case for High Performance Computing with Virtual Machines
A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation
More informationHYCOM Performance Benchmark and Profiling
HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program Note The following research was performed under the HPC Advisory Council activities
More information2008 International ANSYS Conference
2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationStudy. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement
More informationVMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R
VMware vsphere 4: The CPU Scheduler in VMware ESX 4 W H I T E P A P E R Table of Contents 1 Introduction..................................................... 3 2 ESX CPU Scheduler Overview......................................
More informationOpen Source Traffic Analyzer
Open Source Traffic Analyzer Daniel Turull June 2010 Outline 1 Introduction 2 Background study 3 Design 4 Implementation 5 Evaluation 6 Conclusions 7 Demo Outline 1 Introduction 2 Background study 3 Design
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More information10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G
10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures
More informationLEoNIDS: a Low-latency and Energyefficient Intrusion Detection System
LEoNIDS: a Low-latency and Energyefficient Intrusion Detection System Nikos Tsikoudis Thesis Supervisor: Evangelos Markatos June 2013 Heraklion, Greece Low-Power Design Low-power systems receive significant
More informationMay 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to
A Sleep-based Our Akram Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) May 1, 2011 Our 1 2 Our 3 4 5 6 Our Efficiency in Back-end Processing Efficiency in back-end
More informationBinding Nested OpenMP Programs on Hierarchical Memory Architectures
Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de
More informationHigh Performance Java Remote Method Invocation for Parallel Computing on Clusters
High Performance Java Remote Method Invocation for Parallel Computing on Clusters Guillermo L. Taboada*, Carlos Teijeiro, Juan Touriño taboada@udc.es UNIVERSIDADE DA CORUÑA SPAIN IEEE Symposium on Computers
More informationPerformance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC
Performance Evaluation of RDMA over IP: A Case Study with Ammasso Gigabit Ethernet NIC HYUN-WOOK JIN, SUNDEEP NARRAVULA, GREGORY BROWN, KARTHIKEYAN VAIDYANATHAN, PAVAN BALAJI, AND DHABALESWAR K. PANDA
More informationReduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection
Switching Operational modes: Store-and-forward: Each switch receives an entire packet before it forwards it onto the next switch - useful in a general purpose network (I.e. a LAN). usually, there is a
More informationPerformance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and
Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet Swamy N. Kandadai and Xinghong He swamy@us.ibm.com and xinghong@us.ibm.com ABSTRACT: We compare the performance of several applications
More informationNTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.
Messaging App IB Verbs NTRDMA dmaengine.h ntb.h DMA DMA DMA NTRDMA v0.1 An Open Source Driver for PCIe and DMA Allen Hubbe at Linux Piter 2015 1 INTRODUCTION Allen Hubbe Senior Software Engineer EMC Corporation
More informationChelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING
Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationArchitecture and Performance Implications
VMWARE WHITE PAPER VMware ESX Server 2 Architecture and Performance Implications ESX Server 2 is scalable, high-performance virtualization software that allows consolidation of multiple applications in
More informationETHOS A Generic Ethernet over Sockets Driver for Linux
ETHOS A Generic Ethernet over Driver for Linux Parallel and Distributed Computing and Systems Rainer Finocchiaro Tuesday November 18 2008 CHAIR FOR OPERATING SYSTEMS Outline Motivation Architecture of
More informationWhy Your Application only Uses 10Mbps Even the Link is 1Gbps?
Why Your Application only Uses 10Mbps Even the Link is 1Gbps? Contents Introduction Background Information Overview of the Issue Bandwidth-Delay Product Verify Solution How to Tell Round Trip Time (RTT)
More informationBenchmarking message queue libraries and network technologies to transport large data volume in
Benchmarking message queue libraries and network technologies to transport large data volume in the ALICE O 2 system V. Chibante Barroso, U. Fuchs, A. Wegrzynek for the ALICE Collaboration Abstract ALICE
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationOptimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance
Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance S. Moreaud, B. Goglin, D. Goodell, R. Namyst University of Bordeaux RUNTIME team, LaBRI INRIA, France Argonne National Laboratory
More information<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle
Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering Linux Performance on Large Systems Exadata Hardware How large systems are different
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More information1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.
1 Copyright 2011, Oracle and/or its affiliates. All rights ORACLE PRODUCT LOGO Solaris 11 Networking Overview Sebastien Roy, Senior Principal Engineer Solaris Core OS, Oracle 2 Copyright 2011, Oracle and/or
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More information소프트웨어기반고성능침입탐지시스템설계및구현
소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationUnified Runtime for PGAS and MPI over OFED
Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction
More informationnforce 680i and 680a
nforce 680i and 680a NVIDIA's Next Generation Platform Processors Agenda Platform Overview System Block Diagrams C55 Details MCP55 Details Summary 2 Platform Overview nforce 680i For systems using the
More informationDesigning High Performance DSM Systems using InfiniBand Features
Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC Outline Introduction Motivation Design and Implementation Results
More informationImproving the Scalability of a Multi-core Web Server
Improving the Scalability of a Multi-core Web Server Raoufehsadat Hashemian ECE Department, University of Calgary, Canada rhashem@ucalgary.ca ABSTRACT Martin Arlitt Sustainable Ecosystems Research Group
More informationWhat s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1
What s New in VMware vsphere 4.1 Performance VMware vsphere 4.1 T E C H N I C A L W H I T E P A P E R Table of Contents Scalability enhancements....................................................................
More information2008 International ANSYS Conference
28 International ANSYS Conference Maximizing Performance for Large Scale Analysis on Multi-core Processor Systems Don Mize Technical Consultant Hewlett Packard 28 ANSYS, Inc. All rights reserved. 1 ANSYS,
More informationUnifying UPC and MPI Runtimes: Experience with MVAPICH
Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationW H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements
W H I T E P A P E R What s New in VMware vsphere 4: Performance Enhancements Scalability Enhancements...................................................... 3 CPU Enhancements............................................................
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationCommunication Models for Resource Constrained Hierarchical Ethernet Networks
Communication Models for Resource Constrained Hierarchical Ethernet Networks Speaker: Konstantinos Katrinis # Jun Zhu +, Alexey Lastovetsky *, Shoukat Ali #, Rolf Riesen # + Technical University of Eindhoven,
More informationHigh-Performance Key-Value Store on OpenSHMEM
High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background
More information10GE network tests with UDP. Janusz Szuba European XFEL
10GE network tests with UDP Janusz Szuba European XFEL Outline 2 Overview of initial DAQ architecture Slice test hardware specification Initial networking test results DAQ software UDP tests Summary 10GE
More informationImproving Cluster Performance
Improving Cluster Performance Service Offloading Larger clusters may need to have special purpose node(s) to run services to prevent slowdown due to contention (e.g. NFS, DNS, login, compilation) In cluster
More informationMemory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand
Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand Matthew Koop, Wei Huang, Ahbinav Vishnu, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of
More informationThe benefits and costs of writing a POSIX kernel in a high-level language
1 / 38 The benefits and costs of writing a POSIX kernel in a high-level language Cody Cutler, M. Frans Kaashoek, Robert T. Morris MIT CSAIL Should we use high-level languages to build OS kernels? 2 / 38
More informationComputer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.
Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior
More informationPerformance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture
Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of
More informationVirtualizing Agilent OpenLAB CDS EZChrom Edition with VMware
Virtualizing Agilent OpenLAB CDS EZChrom Edition with VMware Technical Overview Abstract This technical overview describes the considerations, recommended configurations, and host server requirements when
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationComparison of Storage Protocol Performance ESX Server 3.5
Performance Study Comparison of Storage Protocol Performance ESX Server 3.5 This study provides performance comparisons of various storage connection options available to VMware ESX Server. We used the
More informationTCPivo A High-Performance Packet Replay Engine. Wu-chang Feng Ashvin Goel Abdelmajid Bezzaz Wu-chi Feng Jonathan Walpole
TCPivo A High-Performance Packet Replay Engine Wu-chang Feng Ashvin Goel Abdelmajid Bezzaz Wu-chi Feng Jonathan Walpole Motivation Many methods for evaluating network devices Simulation Device simulated,
More informationLive Migration of Direct-Access Devices. Live Migration
Live Migration of Direct-Access Devices Asim Kadav and Michael M. Swift University of Wisconsin - Madison Live Migration Migrating VM across different hosts without noticeable downtime Uses of Live Migration
More informationMeltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies
Meltdown and Spectre Interconnect Evaluation Jan 2018 1 Meltdown and Spectre - Background Most modern processors perform speculative execution This speculation can be measured, disclosing information about
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationRedbooks Paper. Network Performance Evaluation on an IBM e325 Opteron Cluster. Introduction. Ruzhu Chen Lerone Latouche
Redbooks Paper Ruzhu Chen Lerone Latouche Network Performance Evaluation on an IBM e325 Opteron Cluster Network performance evaluation is an important measurement that is used to adjust and tune up cluster
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationChapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms
Overview Chapter 13 TRANSPORT Motivation Simple analysis Various TCP mechanisms Distributed Computing Group Mobile Computing Winter 2005 / 2006 Distributed Computing Group MOBILE COMPUTING R. Wattenhofer
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationOptimizing TCP Receive Performance
Optimizing TCP Receive Performance Aravind Menon and Willy Zwaenepoel School of Computer and Communication Sciences EPFL Abstract The performance of receive side TCP processing has traditionally been dominated
More informationPerformance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology
Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp
More informationIntroduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses
Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationDesigning Next Generation Data-Centers with Advanced Communication Protocols and Systems Services
Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji, K. Vaidyanathan, S. Narravula, H. W. Jin and D. K. Panda Network Based Computing Laboratory
More informationOceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.
OceanStor 9000 Issue V1.01 Date 2014-03-29 HUAWEI TECHNOLOGIES CO., LTD. Copyright Huawei Technologies Co., Ltd. 2014. All rights reserved. No part of this document may be reproduced or transmitted in
More informationIO-Lite: A Unified I/O Buffering and Caching System
IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel and Willy Zwaenepoel Rice University (Presented by Chuanpeng Li) 2005-4-25 CS458 Presentation 1 IO-Lite Motivation Network
More informationPerformance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware
Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationSingle-Points of Performance
Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com High-performance computations are rapidly becoming a critical
More informationChapter 2 Computer-System Structure
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
More informationHIGH PERFORMANCE AND SCALABLE MPI INTRA-NODE COMMUNICATION MIDDLEWARE FOR MULTI-CORE CLUSTERS
HIGH PERFORMANCE AND SCALABLE MPI INTRA-NODE COMMUNICATION MIDDLEWARE FOR MULTI-CORE CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationSEDA: An Architecture for Well-Conditioned, Scalable Internet Services
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles
More informationXCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet
XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet Vijay Shankar Rajanna, Smit Shah, Anand Jahagirdar and Kartik Gopalan Computer Science, State University of New York at Binghamton
More informationPCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate
NIC-PCIE-1SFP+-PLU PCI Express x8 Single Port SFP+ 10 Gigabit Server Adapter (Intel 82599ES Based) Single-Port 10 Gigabit SFP+ Ethernet Server Adapters Provide Ultimate Flexibility and Scalability in Virtual
More informationPAC094 Performance Tips for New Features in Workstation 5. Anne Holler Irfan Ahmad Aravind Pavuluri
PAC094 Performance Tips for New Features in Workstation 5 Anne Holler Irfan Ahmad Aravind Pavuluri Overview of Talk Virtual machine teams 64-bit guests SMP guests e1000 NIC support Fast snapshots Virtual
More information