Performance Measurement and Evaluation Tool for Large-scale Systems

Size: px

Start display at page:

Download "Performance Measurement and Evaluation Tool for Large-scale Systems"

Griffin Pearson
5 years ago
Views:

1 Performance Measurement and Evaluation Tool for Large-scale Systems Hong Ong ORNL December 7 th, 2005

2 Acknowledgements This work is sponsored in parts by: The High performance Computing Modernization Office, U.S. Department of Defense. The Office of Advanced Scientific Computing Research, U.S. Department of Energy. This work is performed at the Oak Ridge National Laboratory, which is managed by UT- Battelle, LLC under Contract No. De-AC05-00OR22725.

3 Outline Performance Measurement and Evaluation Existing Tools Limitations and Challenges Workload Characterization OpenWLC Concluding Remarks

4 Performance Evaluation Life Cycle Performance Analysis Analyze Performance Optimize Resource Usage Predict Requirements Predict Application Responsiveness On-going Operation Performance Monitoring Report Performance Report Resource Usage Initial Sizing Driving Forces: Competitions Hardware Software Growth Production Roll out Test Performance Modeling Analyze Requirements Predict Requirements Workload Analysis Analyze Performance Profile Application Performance Tuning Optimize Application Responsiveness Predict Impact of Changes

5 Measurement Tools - Usage Basic usages of measurement tools are: Performance analysis and enhancements of system operations Troubleshooting and recovery of operations of system components Support system services, e.g., Job scheduling Resource management (e.g. when accomplishing load balancing) Collection of information on applications Fault detection or prevention (HA) Security threats and holes detection.

6 Measurement Tools - Features Desirable features include, but not limited to,: Graphic interface (standard GUI or Web portal) Integration with batch job management systems System usage statistics retrieval Availability in cluster distributions Ability to scale to a large system Non-intrusiveness

Typical Performance Evaluation Determine a machine s applicability to a

Use synthetic codes and application kernels to capture performance

Operating System components (e.g., lmbench).

7 Typical Performance Evaluation Determine a machine s applicability to a range of applications. Use synthetic codes and application kernels to capture performance properties of: Hardware & architecture (e.g., NAS, SPEC, SPLASH). Operating System components (e.g., lmbench). Software systems and applications (e.g., NPB, HPCtoolkits, Genesis, STSWM). Various kinds of middleware. Focus: Evaluation of emerging platforms to understand their benefits.

8 Existing Performance Tools System monitoring Provide a continuous collection and aggregation of system performance data. Examples PerfMiner Ganglia NWPerf SuperMon CluMon Nagios PCP MonALISA Application monitoring Measure actual application performance via a batch system. Examples PerfMiner NWPerf Main focus Profiling Provide a static, instrumentation tool, which focuses on source code that users have direct control. Examples PerfSuite HPCToolkits SvPablo TAU Kojak Vampir IPM

9 Results Generated by Tools

10 Limitations && Challenges Generally, there is no lack of performance tools for data gathering and analysis. mpip, gprof, psrun, PAPI, TAU, KOJAK (MPI/OpenMP), HPCToolkit, Paradyn, DynInst, Ganglia, Perfmon, The problem is at two levels: Tool developers who want to integrate more than one tools Application developers who want to use more than one tools

11 Limitations && Challenges (cont ) Challenges for tool developers: Managing software dependencies on other tools High overhead in implementing the one-to-one interoperability Maintaining interoperability as tools evolve Challenges for application developer: What do the different tools do? How hard is it to install/learn/use these tool? Is it appropriate for my application? What to do with the amount performance data collected? Are automatic tuning tools available, what exactly are they? Can multiple tools be used together? What s the overhead of instrumentation? How does code evolution affect the performance analysis process?

12 Limitations && Challenges (cont ) Additionally, there is very limited efforts in providing tools to characterize (system or application) workloads. Currently, users often have to sort out collected data themselves May be that s why there exists so many performance tools! In short, the challenges and limitations come from: Various interfaces, levels of automation, and approaches to information presentation No integrated environment for performance monitoring, analysis, and optimization Most (previous) efforts concentrate on one-to-one tool interoperability

13 So, what can (or need to) be done? A possible solution is to encourage software-reuse by enabling a systematic integration of existing tools, i.e., Component-based performance evaluation framework; A set of modules, both for tools and the application software Explore new algorithms (e.g., interactive/dynamic techniques, algorithm composition) Enable the framework to evolve Only develop new tools when there exists none Reasons such as awkward usability, only support few features we want, worst yet building new tool just for the sake of building one, are not good enough;

14 OSPAT More recently, OSPAT Working groups are investigating: A common installation procedures, e.g., common directory hierarchy and convention for where to put files. A common instrumentation API for source level, compiler level, library level, binary instrumentation A project to track I/O handling in an application. A common set of names and definitions for performance metrics. A common API for thread creation and fork interface A common probe interface for routine entry and exit events A common profile database schema An API to walk the callstack and examine the heap memory Visualization components for drawing histograms and hierarchical displays typically used by performance tools

15 Workload Characterization (WLC)

16 Performance Evaluation Life Cycle Performance Analysis Analyze Performance Optimize Resource Usage Predict Requirements Predict Application Responsiveness Initial Sizing Driving Forces: Performance Modeling Analyze Requirements Predict Requirements On-going Operation Competitions Hardware Software Growth Test Workload Analysis Analyze Performance Profile Application Performance Reporting Report Performance Report Resource Usage Production Roll out Performance Tuning Optimize Application Responsiveness Predict Impact of Changes

17 What is Workload Characterization? WLC plays a key role in all performance evaluation studies It provides a synthetic description of a workload by means of quantitative parameters and functions The objective is to formulate a model to show, capture, and reproduce the static and dynamic behavior of real workloads However, WLC is a difficult as well as a neglected task Deal with a large amount of collected measurements Extensive analysis has to be performed

A Recent Study After analyzing 19,883 jobs Median sustained FLOP performance for jobs over 256CPU s is 3% High Platform Evaluation Computer Design Less than 20% of the memory is in use at any given

18 A Recent Study After analyzing 19,883 jobs Median sustained FLOP performance for jobs over 256CPU s is 3% High Platform Evaluation Computer Design Less than 20% of the memory is in use at any given point in time. Less than 5% of the jobs use heavy IO (>25MB/s per GF DGEMM) Over 60% of the cycles are for jobs that use less than 1/8 of the system. User expertise Most Utilization 10% of the >256CPU jobs have the CPUs scheduled for idle >50% of the time. Low Scalability of application on platform High

19 Workload CPU is under utilized 10% of the >256CPU jobs have the CPU scheduled for idle >50% of the time. Sustained Performance as a function of CPU count Busy Cycles as a function of job size The median sustained performance for jobs over 256CPU s is 3%.

20 Percent of jobs Most users do not utilize the full capability of the system. Memory footprint per node during FY04 on 11.4TF HPCS2 system at PNNL Most jobs use <25% of available memory (max avail is 6-8G) Large jobs use more memory. I need 6GB of memory per CPU CPU s Used

21 Workload CPU and Job size Relation Percent of cycles used by job size 50% <12% of the computer <33% of the computer <50% of the computer >50% of the computer Oct Nov Dec Jan Feb Mar Apr May Jun Jul

22 What we want to know? Methods to answer the following: What percentage of jobs use all that memory/disk? What is the average job size? How does job size impact efficiency of the calculation? Where is the bottleneck for the average run of a job? What is the sustained performance for the average job? Could we have less storage on the next system if we had a shared memory system or a shared disk pool? and more. Basically, we want to characterize more accurately what parts of the system are important for efficient job performance from an average user s perspective

OpenWLC Project Goals http://www.openwlc.org/ Evaluation of deployed platforms to understand how they are used by average users.

How effectively is the box used as opposed to what the box was designed for? Logical grouping of resource usage.

23 OpenWLC Project Goals Evaluation of deployed platforms to understand how they are used by average users. Efficiency of applications, system utilization, average job sizes. How effectively is the box used as opposed to what the box was designed for? Logical grouping of resource usage. Who (users) What (processes) Where (servers) Systems Users Workloads Transactions Work with application developers, users and manufacturers to decrease application runtime and increase resource utilization.

24 Design constraints Load > O(40) metrics (e.g., FLOPs, Int Ops, Mem BW, Network BW, Disk BW.) for all jobs run on the system in a (central) database. Cannot impact jobs performance by greater than 1% (on average) Fine data granularity to see the needs to use different algorithms. Efficency (FLOPS) 80% 70% 60% 50% 40% 30% 20% 10% 0% % average efficiency over 128 nodes (~1/3TF) ccsd Time ccsd(t) Keep all data to develop mathematical center profiles. Cannot be architecture specific for portability

25 Methodology Mathematics Models

26 OpenWLC System Architecture 1. Requirements Analysis 3. Investigate 2. Measurements input Real-Time Analysis Collect UNIX/Linux/Windows XP % Proc Node CPU Utilization by Priority High Web Access XML Database ODBC: SQL M/S Access 40 Medium AM11 12 PM Low 6. Visualize Data Mining Analytical/Statistical Tools Graphical Analysis Workload Model 3. Model Construction Results Analyze Workload Characterization Predict Response Time Analysis Predictive Analysis 5. Evaluation Execute Model Criterion Representative Yes NO Calibrate Model 4. Model Validation

27 Component-based Framework NWPerf for example One Packets Gatherer Per Monitoring Service Node Aggregation Service Node can be layered hierarchically Monitoring Service Node Module Module Module Client Core System 1 Data Aggregation Service Node Packet Gatherer.. Packets Gatherer Queue Aggregation Manager DB Manager push Visualization Service Node Filter System View Analysis View Application View Web View Visualization Manager 4 Modeling Service Node Algo1 Algo2 Algo3 Model Manager 3 2 Data Repository PerfDMF extract Possible local repository at subcollectors. There is a global repository

28 OpenWLC Challenges Scalability: Single node: Software scalability and features. Doing more without breaking or rewriting code. # of metrics. System-wide (due to increasing number of nodes): Communication management. Network load. Data volume at the collection node. Data management. Visualization. Storage. A sensible workload model to characterize what parts of the system are important for efficient job performance from an average user s perspective.

29 Solutions to Scalability Solving single node scalability: Variable sampling frequency e.g., collect set of 10 parameter. Modules/driver may run as a kernel thread. Encode collected data for transmission to collection node to reduce transmission overhead and network traffic. Solving system-wide scalability: Hierarchically layered architecture. Scheduled communication with sub-collectors. Identify a medium term data storage for highly structured time series data. Existing solutions, MySQL and Postgresql, are less than ideal. Better data aggregation methods

30 Data Aggregation and Management Want a high throughput, low overhead, highly synchronized, massively scalable data collection subsystem for collecting arbitrary data. NWPerf does a pretty good job of most of this except the arbitrary data part and ease of use. Supermon has some interesting capabilities in its data encapsulation, but their collection infrastructure has some weaknesses. OpenWLC uses a hybrid approach.

31 How Data Collection Works? Proactive Monitoring Thresholds Status/Alerts Detect Problems Recovery Actions Availability Problem Determination Problem Resolution Client OS KM Collector Client Kernel Data Proactive Planning Performance Check Workload Analysis Bottleneck Analysis Performance Reporting Predictive Analysis Capacity Planning History File Daily Performance Files

32 What types of Statistics? Kernel Level: System totals for activity counts and resource usage as measured by the kernel CPU, Memory, I/O, Network Adapter Activity Disk, Logical Volume information Process Level: Taken from Process Table Process ID, parent Process ID CPU execution time -> CPU Utilization Page Faults, Memory Usage Application Level: Taken from instrumentation code Note: Application and Process Level data can be used as a data source for feedback.

33 Concluding Remarks No shortage of performance evaluation, analysis, and optimization technology (and new capabilities are continuously added) Little shared infrastructure Components can be used to manage complexity and enable dynamic adaptation or multimethod tools A life-cycle environment may be the best longterm solution

34 Concluding Remarks High User expertise Computer Platform Design Evaluation Most Utilization High User expertise Computer Platform Design Evaluation Users Capacity Center Utilization Low High Scalability of application on platform Low High Scalability of application on platform Work with the application community, to raise the users awareness on how to efficiently use the systems and ensure the users applications scale on the given NLCF platforms.

35 Questions?

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of