Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing

Size: px

Start display at page:

Download "Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing"

Stanley Wheeler
5 years ago
Views:

1 Morpheus: Creating Application Objects Efficiently for Heterogeneous Computing Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, Steven Swanson Department of Computer Science and Engineering University of California, San Diego

2 Applications interact with files 2

3 How we process files today GPU 0xBC614E CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 DRAM SSD 3

4 The conventional model CPU/APU Retrieve File Parse data and create objects Compute kernel Creating objects generates traffic DRAM on CPU-memory bus and results in system overhead SSD GPU Compute kernel 4

5 64% Overhead of creating objects 1.0 Object creation GPU Other CPU computation Moving data to GPU Percentage of Execution Time Creating objects is now the bottleneck in applications 0 PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 5 JASPA average

6 High-speed storage doesn t help Throughput of Parsing Input Data (MB/Sec) PageRank SSD RamDrive HDD Very little difference among different storage technologies CC bfs gaussian hybridsort kmeans lud nn srad JASPA GPU accelerated applications 6 average

7 Preventing P2P communication between peripherals GPU CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 DRAM P2P is useless since we need CPU to create application objects Desired data path Real data path in the current model SSD 7

8 We need to rethink the Morpheus processing model! 8

9 Outline The Morpheus model The system architecture Experimental result Conclusion 9

10 Morpheus: Creating application objects in SSDs GPU CPU DRAM SSD Processor 10 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD

11 The Morpheus model CPU/APU Retrieve objects Compute kernel DRAM SSD StorageApp GPU Compute kernel 11

12 Benefits of Morpheus Bypass system overheads applications to take advantage from Allow P2P data communication interconnects Reduce traffic over systemmorpheus: Creating application Lower energy consumption objects in SSDs GPU CPU Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD Processor Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07 SSD 12 6 DRAM

13 Outline The Morpheus model The system architecture Experimental result Conclusion 13

14 Implementing the Morpheus model Application Morpheus runtime Operating System Morpheus-NVMe Driver NVMe-P2P GPU Runtime PCIe Interconnect Hardware GPU Morpheus-SSD 14

15 Morpheus-NVMe NVMe: An interface defines how the host computer should interact with non-volatile memory devices Morpheus-NVMe extensions MInit: install and prepare the execution of a StorageApp MRead: reads and applies the StorageApp on the reading data MWrite: writes and applies the StorageApp on the writing data MDeinit: completes and releases the StorageApp 15

Morpheus-SSD DDR3/DDR4 DRAM Flash Managing Morpheus-NVMe commands Executing StorageApps PCI EXPRESS Flash Flash Flash Flash PCIe/NVMe Interface Embedded Embedded Embedded Embedded core core

16 Morpheus-SSD DDR3/DDR4 DRAM Flash Managing Morpheus-NVMe commands Executing StorageApps PCI EXPRESS Flash Flash Flash Flash PCIe/NVMe Interface Embedded Embedded Embedded Embedded core core Embedded Embedded core core Embedded Embedded core core core core In-storage Interconnect Accelerator Accelerator Accelerator Accelerator DMA Engine flash interface DRAM controller Flash memory SSD DRAM 16

NVMe-P2P Mapping GPU device memory to PCIe BAR using AMD

commands using GPU memory addresses as the DMA targets

addresses, without going through the main memory GPU 17

17 NVMe-P2P Mapping GPU device memory to PCIe BAR using AMD DirectGMA or NVIDIA GPUDirect Generate Morpheus-NVMe commands using GPU memory addresses as the DMA targets Morpheus directly pulls/pushes data from/to GPU addresses, without going through the main memory GPU 17 Application objects 0xCAB23A 0x002FEA 0x1AA360 0x8CA520 0xBC614E 0x7BDE07

$inputapplet (ms_stream ssd_input_stream, void *edge_array) { Edge ssd_edge_array[4096]; int i = 0; while(ms_scanf(ssd_input_stream, "%d %d", &ssd_edge_array[i%4096].first, &ssd_edge_array[i%4096].$

18 Creating a StorageApp Use C to compose a StorageApp Use the Morpheus-SSD library to access SSD resources The compiler generates machine code that the embedded processors can execute StorageApp int inputapplet (ms_stream ssd_input_stream, void *edge_array) { Edge ssd_edge_array[4096]; int i = 0; while(ms_scanf(ssd_input_stream, "%d %d", &ssd_edge_array[i%4096].first, &ssd_edge_array[i%4096].second)==2) { i++; if(i % 4096 == 0) { ms_memcpy(edge_array, ssd_edge_array, sizeof(edge)*4096); edge_array += sizeof(edge)*4096; } } ms_memcpy(edge_array, ssd_edge_array, sizeof(edge)*(i%4096)); return i; } 18

19 Invoking a StorageApp in host applications Like calling a function Prepare arguments using the Morpheus runtime library The runtime library interacts with the driver to utilize the SSD facilities void test_distributed_page_rank(char* graphfilename, int num_ofvertex, int num_ofedges, int iterations) { FILE *fin; ms_stream ssd_input_stream; void **arg_list; fin = fopen(graphfilename, "r"); ssd_input_stream = ms_stream_create(fin); Edge *edge_array = (Edge *)malloc(sizeof(edge)*num_ofedges); inputapplet(ssd_input_stream, edge_array); ms_stream_destroy(ssd_input_stream); // The rest of code... } 19

20 Outline The Morpheus model The system architecture Experimental result Conclusion 20

Experimental setup Intel Xeon E5-2609 v2 processor NVIDIA K20 GPU Morpheus-SSD:

21 Experimental setup Intel Xeon E v2 processor NVIDIA K20 GPU Morpheus-SSD: A 512GB SSD with a PMCS (now Microsemi) NVMe controller K20 GPU Morpheus-SSD 21

22 Morpheus improves performance Morpheus-SSD Morpheus+NVMe-P2P 1.32x 1.39x 1.2 Speedup PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 22 JASPA average

23 Morpheus saves power/energy Power Energy 1.32x -7 % 1.39x -42% Normalized Value PageRank CC bfs gaussian hybridsort kmeans lud nn srad GPU accelerated applications 23 JASPA average

24 Morpheus makes wimpy servers more competitive G CPU Morpheus-SSD on 1.2G CPU Morpheus-SSD on 1.2G CPU + NVMe-P2P Morpheus-SSD + wimpy CPUs can compete with high-end servers Speedup over 2.5G CPUs x 1.08x 1.12x 0 PageRank CC bfs gaussian hybridsort kmeans lud nn GPU accelerated applications 24 srad JASPA average

25 Conclusion Object creation/deserialization/serialization becomes a new bottleneck for highperformance heterogeneous computers Morpheus model leverages under-utilized computing resources in storage device to bypass system overheads enable efficient data communication mechanisms Morpheus-SSD improves application performance by 1.39x and allows wimpy servers to compete with high-end servers 25

26 Thank you! Hung-Wei Tseng will be an assistant professor in starting from this August 26

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing

Morpheus: Exploring the Potential of Near-Data Processing for Creating Application Objects in Heterogeneous Computing Hung-Wei Tseng North Carolina State University Qianchen Zhao Arista Networks Yuxiao