GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group

Size: px

Start display at page:

Download "GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group"

Beryl Cain
6 years ago
Views:

1 GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group Philippe O. A. Navaux HPC e novas Arquiteturas CPTEC - Cachoeira Paulista - SP 8 de março de 2017

2 Team Professor: Philippe O. A. Navaux Post-Doc: Francieli Z. Boito (ended 11/2015) Marco A. Zanata Alves (ended 3/2016) Matthias Diener Eduardo Cruz Master Students: Jean Bez Jimmy Sanchez Matheus Serpa PhD Students: Daniel Oliveira Eduardo Roloff Edson Padoin Emmanuell Carreño Francis Birck Moreira Rafael Tesser Rodrigo Kassick Victor Abaunza

3 Research Areas #1: Memory Hierarchy Optimization - Data and Thread Mapping - Scheduling - Placement - Memory Request Scheduling People: A. Carissimi, E. Cruz, F. Moreira, M. Diener, P. Navaux #2: Multi-core Architecture and Power Consumption Optimization - Automatic frequency control in heterogeneous systems - Energy efficiency of cache memories People: E. Padoin, M. Alves, P. Navaux #3: I/O Optimization - Application-Guided I/O Scheduling - Dynamic I/O Reconfiguration People: Francieli Zanon Boito, Jean Bez, Rodrigo Kassick, P. Navaux

4 Research Areas (2) #4: Software errors on HPC architectures - Fault Tolerance Techniques Efficiency - Radiation Sensitivity People: Daniel Oliveira, Luigi Carro, Paolo Rech, Philippe Navaux #5: Applications in HPC Systems - Dynamic Load Balancing - Distributed Systems Applications - GPU implementation - Models and Metrics People: Rafael Tesser, Victor Abaunza, Philippe Navaux #6: CLOUD Computing (including IoT and Big Data) - HPC in Cloud - Data Intensive Analysis People: A. Carissimi, E. Roloff, Emmanuell Carreño, J. Sanchez, P. Navaux

placing memory pages on memory controllers that perform most accesses to them (data mapping) Previous Results: speedup of up to 4x using online

5 #1 Affinity-based thread and data mapping Objective: Improve performance and energy efficiency of memory accesses by: 1. executing threads that access the same data close to each other in the hierarchy (thread mapping) 2. placing memory pages on memory controllers that perform most accesses to them (data mapping) Previous Results: speedup of up to 4x using online mechanisms Papers: Matthias Diener et al. kmaf: Automatic Kernel-Level Management of Thread and Data Affinity. PACT Eduardo Cruz et al. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. JPDC 2014.

Reduction of the total energy consumption of up to: 11% over ScotchLB

6 #2: Power for Exascale DVFS and load balancing - EnergyLB Ondes Lulesh Applying DVFS and load balancing during the execution: Reduction of the total energy consumption of up to: 11% over ScotchLB and 8.7% over GreedyLB -> Ondes3D 10% over ScotchLB and 6.3% over GreedyLB -> Lulesh

#2: Power for Exascale ARM Processor (+GPU) in HPC Objective: Verify the Energy Efficiency of a seismic model on a low power heterogeneous architecture.

Intel Xeon E5645 + 8 Tesla TM M2075 (2.99x more efficient) o 1 Intel i7-930 + 1 Tesla K20c (5.82x more efficient) Víctor Martínez, et al.

7 #2: Power for Exascale ARM Processor (+GPU) in HPC Objective: Verify the Energy Efficiency of a seismic model on a low power heterogeneous architecture. o Jetson TK1 (Heterogeneous cores) o Seismic Model: Ondes 3D (BRGM - France) Previous Results: ARM Cortex-A15 + NVIDIA GK20a GPU (Jetson TK1) compared to: o 2 Intel Xeon E Tesla TM M2075 (2.99x more efficient) o 1 Intel i Tesla K20c (5.82x more efficient) Víctor Martínez, et al. Task-based programming on low-power Nvidia Jetson TK1 manycore architecture: Application to earthquake modeling. Latin America High Performance Computing Conference (CARLA '2015) 2015.

#3: Storage for Exascale Parallel I/O for HPC Objective: To provide scalable high performance I/O for HPC architectures Research topics: I/O scheduling for parallel

storage servers Boito, F. et al. Automatic I/O Scheduling Algorithm Selection for Parallel File Systems. Concurrency and Computation: Practice and Experience.

8 #3: Storage for Exascale Parallel I/O for HPC Objective: To provide scalable high performance I/O for HPC architectures Research topics: I/O scheduling for parallel file systems I/O scheduling in the I/O forwarding layer Pattern matching for access pattern detection (collaboration with Barcelona Supercomputing Center) Low-power storage servers Boito, F. et al. Automatic I/O Scheduling Algorithm Selection for Parallel File Systems. Concurrency and Computation: Practice and Experience. Wiley, Boito, F. et al. AGIOS: Application-Guided I/O Scheduling for Parallel File Systems. International Conference on Parallel and Distributed Systems (ICPADS), 2013.

#3: Coordinate Access to Parallel File System Servers Objective Coordinate the access of I/O nodes to the data servers to reduce contention TWINS Scheduler We designed a scheduler that uses time

9 #3: Coordinate Access to Parallel File System Servers Objective Coordinate the access of I/O nodes to the data servers to reduce contention TWINS Scheduler We designed a scheduler that uses time windows to coordinate the I/O nodes accesses to different data servers Results Improve read performance of shared files By up to 28% over alternatives and by up to 50% over not forwarding I/O requests Bez et al., TWINS: Server Access Coordination in the I/O Forwarding Layer, in Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2017.

#3: Towards Energy-Efficient Storage Servers in HPC Objective Evaluate the viability of using low-power architectures as file systems servers Data servers Processing power is less important ARM

10 #3: Towards Energy-Efficient Storage Servers in HPC Objective Evaluate the viability of using low-power architectures as file systems servers Data servers Processing power is less important ARM processors as an alternative Experiments with: Representative access patterns Hou10ni application Results Replace one regular data server by two ARM boards would double the bandwidth and decrease energy consumption by 85% while not compromising on performance, specially for read-intensive workloads Machado et al., Towards Energy-Efficient Storage Servers, in the 32nd ACM Symposium on Applied Computing (SAC), 2017.

11 #4: Transient errors on HPC architectures Objective: Evaluation and mitigation of radiation-induced errors in HPC. We perform radiation experiments in Los Alamos and Didcot to measure the error rate of Xeon-Phi, K40, APU (CPU+GPU), TK1, and etc.. neutrons We design experimentally-tuned mitigation strategies UFRGS setup at LANSCE, Los Alamos. Nov Previous Results: Predict and validate Titan radiation-induced error rate (HPCA2015), design Algorithm Based Fault Tolerance for MxM and FFT, and Duplication With Comparison for other codes (Trans. Comp. 2015). We have, for the first time, compared the error rate of Xeon-Phi and K40 (submitted to SELSE2016).

#5: Software for Exascale: Improving scheduling on heterogeneous architectures Objective: Improve scheduling on heterogeneous architectures: 1.

Executing parallel tasks on heterogeneous architectures (CPU+accelerators) using all available cores for computing.

12 #5: Software for Exascale: Improving scheduling on heterogeneous architectures Objective: Improve scheduling on heterogeneous architectures: 1. Splitting a seismic model called Ondes 3D (BRGM - France) into tasks. 2. Executing parallel tasks on heterogeneous architectures (CPU+accelerators) using all available cores for computing. Previous Results: Maximum speedup o Only accelerator cores - if simulation fits on memory (in-core): up to 7x. o All cores (CPU+accelerator) - if simulation doesn t fit on memory (out-of-core): up to 25x. Víctor Martínez, et al. Towards seismic wave modeling on heterogeneous many-core architectures using task-based runtime system. SBAC-PAD 2015.

Improve the performance of cloud resource using task mapping Previous Results: Compreensive model for cost-efficiency o Tested using

13 #6 Cloud Computing Objective: Study of Cloud as an environment for HPC 1. Study the cost-efficiency of public clouds for HPC. 2. Port HPC Applications to the cloud. 3. Improve the performance of cloud resource using task mapping Previous Results: Compreensive model for cost-efficiency o Tested using Azure, EC2 and Rackspace CLOUD 2012, CloudCom 2012 and Book Chapter BRAMS (weather prediction) ported to Azure o Several improvments made using cloud features ICCS 2015 Task mapping improves the performance up to 40% CCGRID 2016

14 H2020 Project Participation WP2 Disruptive Exascale Computer Architecture

15 General Organization 15 HPC4E Slides Template *

16 Tasks Partners MAIN COLLABORATIONS Tasks Transversal WPs Deliverables Memory Pages Mapping UFRGS, LNCC, INRIA 2.2 Full Waveform Inversion BSC, COPPE, 2.1, 2.2, 2.3, WP6 UFRGS, LNCC, INRIA , 2.2, 2.3, 2.4, 2.6 Acoustic Propagation on GPUs ITA, UFRGS, Petrobras, BSC WP6 2.1, 2.3 Elastic Propagation on Intel s Architectures BSC, REPSOL, LNCC, 2.1, 2.2, 2.4 COPPE WP6 2.1, 2.2, 2.4 BOAST Kernels for ALYA INRIA, BSC, UFRGS 2.1, 2.2, 2.4 WP4, WP5 2.1, 2.2, 2.3, 2.4 GPU Kernels for ALYA BSC, COPPE, ITA 2.1, 2.4 WP4, WP5 2.1, 2.3, 2.4 Radiation-induced Error Criticality UFRGS, BSC 2.1, 2.4 WP4, WP5, WP6 2.1 Porting libmesh to MontBlanc COPPE, BSC 2.1, 2.3 WP3, WP HPC4E Slides Template 2.1, 2.3 WP4, WP5, WP6 2.2 *

17 Thanks! GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group HPC4E Project: Research has received funding from the EU H2020 Programme and from MCTI/RNP-Brazil under grant agreement n

Radiation-Induced Error Criticality In Modern HPC Parallel Accelerators

Radiation-Induced Error Criticality In Modern HPC Parallel Accelerators Presented by: Christopher Boggs, Clayton Connors on 09/26/2018 Authored by: Daniel Oliveira, Laercio Pilla, Mauricio Hanzich, Vinicius