TRW. Agent-Based Adaptive Computing for Ground Stations. Rodney Price, Ph.D. Stephen Dominic TRW Data Technologies Division.

Similar documents
DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Highly Available Forms and Reports Applications with Oracle Fail Safe 3.0

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

LECTURE 5: MEMORY HIERARCHY DESIGN

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.


Chapter 1: Introduction. Operating System Concepts 8th Edition,

The Google File System

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Nimble Storage vs HPE 3PAR: A Comparison Snapshot

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

The Google File System

Data Intensive Scalable Computing

IBM InfoSphere Streams v4.0 Performance Best Practices

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

NETWORK AUDIT FOR CHOPPY AND SONS, LLC

NPTEL Course Jan K. Gopinath Indian Institute of Science

Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture

Windows 2000/XP History, and Data Management

Adapted from David Patterson s slides on graduate computer architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Parallel Motif Search Using ParSeq

Multicore Programming

Linux Clusters for High- Performance Computing: An Introduction

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

6. Results. This section describes the performance that was achieved using the RAMA file system.

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays

Experiences with HP SFS / Lustre in HPC Production

Multi-Channel Clustered Web Application Servers

Achieving Scalability and High Availability for clustered Web Services using Apache Synapse. Ruwan Linton WSO2 Inc.

Definition of RAID Levels

Implementing Software RAID

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Redundancy for Routers using Enhanced VRRP

Broadcast-Quality, High-Density HEVC Encoding with AMD EPYC Processors

DELL EMC CX4 EXCHANGE PERFORMANCE THE ADVANTAGES OF DEPLOYING DELL/EMC CX4 STORAGE IN MICROSOFT EXCHANGE ENVIRONMENTS. Dell Inc.

Massive Data Processing on the Acxiom Cluster Testbed

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

Memory Technology. Assignment 08. CSTN3005 PC Architecture III October 25, 2005 Author: Corina Roofthooft Instructor: Dave Crabbe

DC Assignment III. Communication via circuit switching implies that there is a dedicated communication path between two stations.

The Optimal CPU and Interconnect for an HPC Cluster

A Distributed Network Architecture for PC-Based Telemetry Systems

Are distributed objects fast enough?

Chapter 13: I/O Systems

Cluster Computing. Cluster Architectures

PowerEdge 350 systems contain the following major features:

The Implications of Multi-core

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

The Role of Performance

Cluster Computing. Cluster Architectures

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Minimum System Requirements for Horizon s new Windows Product

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

A Global Operating System for HPC Clusters

NOW and the Killer Network David E. Culler

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Track pattern-recognition on GPGPUs in the LHCb experiment

hot plug RAID memory technology for fault tolerance and scalability

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

Cluster-Based Scalable Network Services

Computer-System Organization (cont.)

The Microsoft Large Mailbox Vision

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

I/O CANNOT BE IGNORED

The usability of Java for developing parallel applications

Course Details. Operating Systems with C/C++ Course Details. What is an Operating System?

Network performance. slide 1 gaius. Network performance

Tools and Methodology for Ensuring HPC Programs Correctness and Performance. Beau Paisley

This page intentionally left blank

Recap: Machine Organization

Cheap macbook pro refurbished

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University

Advanced cache optimizations. ECE 154B Dmitri Strukov

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

SCRAMNet GT. A New Technology for Shared-Memor y Communication in High-Throughput Networks. Technology White Paper

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Database Management Systems, 2nd edition, Raghu Ramakrishnan, Johannes Gehrke, McGraw-Hill

WebSphere Application Server Base Performance

April 21, 2017 Revision GridDB Reliability and Robustness

1.0 PRODUCT DISCUSSION

SKA Monitoring & Control Realisation Technologies Hardware aspects. R.Balasubramaniam GMRT

Introduction to High-Performance Computing

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

Operating Systems. Lecture Course in Autumn Term 2015 University of Birmingham. Eike Ritter. September 22, 2015

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Chapter 17: Distributed Systems (DS)

JMR ELECTRONICS INC. WHITE PAPER

Best Practices for Setting BIOS Parameters for Performance

Transcription:

Agent-Based Adaptive Computing for Ground Stations Rodney Price, Ph.D. Stephen Dominic Data Technologies Division February 1998 1

Target application Mission ground stations now in development Very large signal processing task: uses ~10 large parallel processing machines with up to 64 processors each. Very data intensive: input data rates of up to tens of megabytes per second. 7/24 operation: time available for repairs, maintenance, software upgrades is minimal. Long-range plans Reduce up-front costs -- do more with less: the budget for today s big iron may not be available. Reduce maintenance costs: run the ground station in darkened mode -- no on-site operators. Increase ability to handle surges in input data rates. 2

Why are we doing this? 64 processor SGI Origin 2000 195 MHz R10000 64-bit CPU s, 4 MB cache. HIPPI networking, 800 Mb/s throughput. Price: $1.1 million (after 30% discount); estimate from SGI sales office. 100 processor PC farm 300 MHz Pentium II CPU s, 2 CPU s per machine. Myrinet networking, 250 Mb/s throughput under IP, >1000 Mb/s raw. Price: $360,000 (unit pricing); estimate from Dell and Myricom. Because the PC farm has many identical, interchangeable parts, we can build in very adaptive, very fault-tolerant behavior in the software. 3

Some philosophy Nature is full of adaptive, robust systems: What makes an ant colony work so well? Interactions between ants are always local, never global. Global behavior (bringing food home) results from the ants following simple rules in their purely local interactions with other ants. 4

What has this got to do with computing? Our computations are carried out by a system of many software agents. Agents are mobile -- they can move around the network. Agents always interact locally with other agents. Agents are goal-oriented -- they have a task to perform. Agents are aware of their immediate surroundings. Agents follow simple rules to use system resources efficiently. Good dynamic load-balancing occurs because: Each agent has a part of the overall signal processing problem. The agents compete for processor cycles and network bandwidth. The agents have simple heuristics for moving from processor to processor. Fault-tolerance and adaptive behavior are side-effects! They are emergent behaviors, just as flocking is an emergent behavior of birds. 5

Proof of concept system Our current system consists of four dual Pentium Pro machines (eight processors) running Windows NT. 10 Mb/s Ethernet We have chosen a simple but important signal processing algorithm, clustering by region-growing, as a test problem for our agents system. Java is our test platform. On our clustering application, the Symantec Café 2.0 Java JIT compiler beats MS C++ 5.0 and is only slightly slower than Symantec C++ 7.5. We use our own lightweight agents framework. 6

Clustering by region-growing The region-growing clustering algorithm proceeds in three steps: Start by forming preliminary clusters of radius d 1 or less. Remove all singleton clusters, clusters with only one point. Use the centroids of the preliminary clusters as new data points for the final clustering. The final clusters have radius d 2 or less. We scatter the incoming points among the machines in the network. machine 1 machine 2 machine 3 7

Agent system architecture report out begin processing merge A merge A master A LB timer report timer master B D C merge B load b. LB timer master B slave slave load b. merge timer merge B C buddy timer E D timer buddy timer timer 8

Performance Agent overhead is about 15-25%. Tested by running optimized sequential clustering against agent-based clustering on a single machine. Parallel performance: We get nearly linear speedup on up to four machines. 9

Adaptive behavior Machines can be withdrawn cleanly from a computation without compromising the results. New machines added to the network are discovered and put to use without operator intervention. (Other than starting an agent-host server.) A shutdown agent-host server repels all agents. Clearing the shutdown makes the machine available for use. Agents cannot go anyplace where an agent-host server is not running. 10

Fault-tolerance Unexpected thread or agent death: If an agent sends another a message, and the receiving agent is gone, an exception is generated, and a new agent is created. This agent may or may not finish operating on its data in time. If not, some data is lost, so the result loses some accuracy. Unexpected processor crash: Exceptions are generated on messaging, but data on the processor is lost. Agents ignore the processor thereafter. Data processing always proceeds correctly after the next timer interval. No operator intervention is required. Unexpected network failure: Operations continue on separate processors. Partial results are generated on each machine; separate reports are generated. 11

Configuration management Each machine in the network needs an identical copy of the agent-host server. All application code starts execution on a single machine. The agents themselves distribute the code as the algorithm runs. Then application code has to be introduced only on the machine the computation originates on. 12

Summary Our system consists of many mobile, lightweight agents. Fault-tolerance and adaptive behavior emerge from the competition of agents for system resources. Our approach addresses the requirements of next-generation mission ground stations with Inexpensive COTS hardware and system software. Adaptive behavior in response to load fluctuations. Robust fault-tolerance, suited to unattended 7/24 operation. 13