Near-Data Processing for Differentiable Machine Learning Models
|
|
- Sibyl Ross
- 6 years ago
- Views:
Transcription
1 Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2 and Sungroh Yoon 1,3 1 Electrical and Computer Engineering, Seoul National University 2 Electrical and Electronic Engineering, Yonsei University 3 Neurology and Neurological Sciences, Stanford University Correspondence: sryoon@snu.ac.kr Homepage: May 19th, 2017 Hyeokjun Choe et al. MSST 2017 May 19th, / 33
2 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
3 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
4 Machine Learning s Success Big data Powerful parallel processors Sophisticated models Hyeokjun Choe et al. MSST 2017 May 19th, / 33
5 Issues on Conventional Memory Hierachy Data movement in memory hierarchy Computational efficiency Power consumption Hyeokjun Choe et al. MSST 2017 May 19th, / 33
6 Near-data Processing (NDP) Memory or storage with intelligence (i.e., computing power) Process the data stored in memory or storage Reduce the data movements, CPU offloading Hyeokjun Choe et al. MSST 2017 May 19th, / 33
7 ISP-ML ISP-ML: a full-fledged ISP-supporting SSD platform Easy to implement machine learning algorithm in C/C++ For validation, three SGD algorithms were implemented and experimented with ISP-ML User Application SSD controller SSD OS CPU Embedded Processor(CPU) SRAM ISP SW Host I/F Main Memory DRAM SSD controller SRAM ARM Processor ISP SW SSD Channel Controller ISP HW Cache Controller Channel ISP HW Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash Host I/F DRAM Cache Controller ISP HW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33
8 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
9 Machine Learning as an Optimization Problem Machine learning categories Supervised learning, unsupervised learning, reinforcement learning The main purpose of supervised machine learning Find the optimal θ that minimizes F (D; θ) F (D, θ) = L(D, θ) + r(θ) (1) D : input data θ : model parameters L : loss function r : regularization term F : objective function Input layer Output layer Hyeokjun Choe et al. MSST 2017 May 19th, / 33
10 Gradient Descent θ t+1 = θ t η F (D, θ t ) (2) = θ t η i F (D i, θ t ) (3) η : learning rate t : iteration index i : data sample index 1st-order iterative optimization algorithm Use all samples per iteration Stochastic gradient descent (SGD) Use only one sample per iteration. Minibatch stochastic gradient descent Between gradient descent and SGD Use multiple samples per iteration Hyeokjun Choe et al. MSST 2017 May 19th, / 33
11 Parallel and Distributed SGD Synchornous SGD Parameter server aggregates θ slave synchronously. Downpour SGD Workers communicate with parameter server asynchronously. Elastic Average SGD (EASGD) Each worker has own parameters Workers transfer (θ slave θ master ), not θ slave Hyeokjun Choe et al. MSST 2017 May 19th, / 33
12 Fundamentals of Solid-State Drives (SSDs) SSD Controller Embedded processor for FTL HDD emulation Wear Leveling, Garbage collection, etc. Cache controller Channel controller DRAM Cache and Buffer 512MB - 2GB NAND flash arrays Simultaneously accessible Host interface logic SATA, PCIe Hyeokjun Choe et al. MSST 2017 May 19th, / 33
13 Previous Work on Near-Data Processing:PIM Perform computation inside the main memory 3D stacked memory (e.g. HMC) is used for PIM recently Implement processing unit in Logic Layer Applications: sorting, string matching, CNN, matrix multiplication etc. Hyeokjun Choe et al. MSST 2017 May 19th, / 33
14 Previous Work on Near-Data Processing:ISP Perform computation inside the storage ISP with embedded processor Pros: easy to implement, flexible Cons: no parallelism ISP with dedicated hardware logic Pros: channel parallelism, hardware acceleration Cons: hard to implement and change Applications: DB query (scan, join), linear regression, k-means, string match etc. Hyeokjun Choe et al. MSST 2017 May 19th, / 33
15 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
16 Host I/F SSD controller ARM Processor Cache Controller ISP HW DRAM SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash ISP-ML: ISP Platform for Machine Learning on SSDs ISP-supporting SSD simulator Implemented in SystemC on the Synopsys Platform Architect Software/Hardware co-simulation Easily executes various machine learning algorithms in C/C++ Transaction level simulator For reasonable simulation speed ISP components SRAM ISP SW, ISP HW User Application OS CPU Main Memory SSD Host I/F DRAM SSD controller Embedded Processor(CPU) Cache Controller ISP HW SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash SSD NAND Flash NAND Flash Embedded Processor clk/rst Host I/F Cache Controller DRAM Channel Controller NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33
17 Host I/F ARM Processor Cache Controller ISP HW DRAM SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash NAND Flash NAND Flash ISP-ML: ISP Platform for Machine Learning on SSDs We implemented two types of ISP hardware components. Channel controller: perform primitive operations on the stored data. Cache controller: collect the results from each of the channel controller. Master-slave architecture They communicate with each other. User Application OS CPU Main Memory SSD controller SSD Host I/F DRAM SSD controller Embedded Processor(CPU) Cache Controller ISP HW SRAM ISP SW Channel Controller ISP HW Channel Controller ISP HW NAND Flash NAND Flash SSD NAND Flash NAND Flash Hyeokjun Choe et al. MSST 2017 May 19th, / 33
18 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
19 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
20 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
21 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
22 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
23 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
24 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
25 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
26 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
27 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
28 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
29 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
30 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
31 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
32 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
33 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
34 Parallel SGD Implementation on ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
35 Methodology for IHP-ISP Performance Comparison Ideal Ways to Fairly Compare ISP and IHP 1 Implementing ISP-ML in a real semiconductor chip High chip manufacturing costs 2 Simulating IHP in the ISP-ML framework. High simulation time to simulate IHP 3 Implementing both ISP and IHP using FPGAs. Require another significant development efforts. Hard to fairly compare the performances of ISP and IHP We propose a practical comparison methodology Hyeokjun Choe et al. MSST 2017 May 19th, / 33
36 Methodology for IHP-ISP Performance Comparison (a) Host Real System Simulator ISP Cmd (b) In Host Measure observed IHP execution time(t total ) Measure data IO time(t IO ) IO Trace Extract IO trace while executing application Storage ISP-ML (baseline) ISP-ML (ISP implemented) In SSD (Sim) Measure baseline simulation time with IO trace(t IOsim ) Observed IHP execution time = T total = T nonio + T IO. Expected IHP simulation time = T nonio + T IOsim = T total - T IO + T IOsim. T IO T nonio T IOsim : Data IO latency time of the storage : Non-data IO time : Data IO time of the baseline SSD in ISP-ML Hyeokjun Choe et al. MSST 2017 May 19th, / 33
37 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
38 Setup and Implementation Host specifications CPU 8-core Intel(R) Core i7-3770k (3.50GHz) Main memory DDR3 32GB RAM Storage Samsung SSD 840 Pro OS Ubuntu LTS ISP-ML specifications Embedded processor ARM 926EJ-S (400MHz) FTL DFTL Page size 8KB t prog / t read / t blockerase 300us / 75us / 5ms FPU 0.5 instruction/cycle(pipelined) Dataset x10 amplified MNIST(handwritten digits) Hyeokjun Choe et al. MSST 2017 May 19th, / 33
39 Performance Comparison:ISP-Based Optimization EASGD showed best performance in this experiment. x2.96 against synchornous SGD on average. x1.41 against Downpour SGD on average. For 4,8 Ch, synchronous SGD was slower than Downpour SGD For 16 Ch, synchronous SGD was faster than Downpour SGD 0.94 (a) 4-Channel 0.94 (b) 8-Channel 0.94 (c) 16-Channel Test accuracy Synchronous SGD Downpour SGD EASGD Test accuracy Time(sec) Synchronous SGD Downpour SGD EASGD Test accuracy Time(sec) Synchronous SGD Downpour SGD EASGD Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33
40 Performance Comparison:IHP versus ISP Compared IHP in memory shortage situation with ISP In large-scale machine learning, the computing systems used may suffer from memory shortages. Assumption: The host had already loaded all the data to main memory for IHP. ISP-based EASGD with 16 channels obtained the best performance in our experiments Test accuracy ISP(EASGD, 4CH) IHP(2GB-memory) IHP(16GB-memory) ISP(EASGD, 8CH) IHP(4GB-memory) IHP(32GB-memory) ISP(EASGD, 16CH) IHP(8GB-memory) Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33
41 Channel Parallelism The speed-up tends to be proportional to the number of channels. Because the communication overhead in ISP is negligible. In distributed computing systems, communication bottleneck commonly occurs (a) Synchronous SGD 0.94 (b) Downpour SGD Test accuracy Test accuracy (c) EASGD 4-Channel 8-Channel 16-Channel Time(sec) 4-Channel 8-Channel 16-Channel Time(sec) Test accuracy Speed up Channel 8-Channel 16-Channel Time(sec) Synchronous SGD Downpour SGD EASGD (d) Speed up Channel Hyeokjun Choe et al. MSST 2017 May 19th, / 33
42 Effects of Communication Period in Async. SGD Downpour SGD High speed for a low communication period [τ=1; 4] Unstable for a high communication period [τ =16; 64] EASGD Communication period, convergence speed In contrast to the distributed computing system Because of the low communication overhead 0.90 (a) Downpour SGD 0.92 (b) EASGD Test accuracy τ = 1 τ = 4 τ = 16 τ = Time(sec) Test accuracy τ = 1 τ = 4 τ = 16 τ = Time(sec) Hyeokjun Choe et al. MSST 2017 May 19th, / 33
43 Experimental Results Summary 1 EASGD shows the best performance in our ISP-ML environment. 2 ISP is more efficient than IHP while host suffers from insufficient main memory. ISP may be useful in large scale machine learning. 3 The speed-up by parallelizing is proportional to the number of channels. Because of the ultra fast on-chip communication. 4 The performance of EASGD decreases while the communication period increases unlike conventional distributed system. Hyeokjun Choe et al. MSST 2017 May 19th, / 33
44 Outline 1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion Hyeokjun Choe et al. MSST 2017 May 19th, / 33
45 Parallelism in ISP ISP can provide various advantages for data processing involved in machine learning. E.g. ultra-fast on-chip communication Increase energy efficiency, security, and reliability High degree of parallelism could be achieved. By increasing the number of channels inside an SSD. Exploiting a hierarchy of parallelism Distributed systems + ISP-based SSDs Hyeokjun Choe et al. MSST 2017 May 19th, / 33
46 Opportunities for Future Research 1 Implementing deep neural networks in ISP-ML framework 2 Implementing adaptive optimization algorithms E.g. Adagrad and Adadelta 3 Pre-computing metadata during data writes 4 Implementing data shuffling functionality 5 Investigate the effect of NAND flash design on performance Hyeokjun Choe et al. MSST 2017 May 19th, / 33
47 Conclusion Create full-fledged ISP-supporting SSD simulator supporting ML Implement and compare multiple versions of parallel SGD Propose fair comparison methodology between IHP and ISP Intrigue future research opportunities in terms of exploiting the channel parallelism Hyeokjun Choe et al. MSST 2017 May 19th, / 33
48 Acknowledgments Hyeokjun Choe et al. MSST 2017 May 19th, / 33
49 Q/A Hyeokjun Choe et al. MSST 2017 May 19th, / 33
Near-Data Processing for Differentiable Machine Learning Models
Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1, Seil Lee 1, Hyunha Nam 1, Seongsik Park 1, Seijoon Kim 1, Eui-Young Chung 2, Sungroh Yoon 1,3 1 Electrical and Computer
More informationVSSIM: Virtual Machine based SSD Simulator
29 th IEEE Conference on Mass Storage Systems and Technologies (MSST) Long Beach, California, USA, May 6~10, 2013 VSSIM: Virtual Machine based SSD Simulator Jinsoo Yoo, Youjip Won, Joongwoo Hwang, Sooyong
More informationArchitecture Exploration of High-Performance PCs with a Solid-State Disk
Architecture Exploration of High-Performance PCs with a Solid-State Disk D. Kim, K. Bang, E.-Y. Chung School of EE, Yonsei University S. Yoon School of EE, Korea University April 21, 2010 1/53 Outline
More informationLETTER Solid-State Disk with Double Data Rate DRAM Interface for High-Performance PCs
IEICE TRANS. INF. & SYST., VOL.E92 D, NO.4 APRIL 2009 727 LETTER Solid-State Disk with Double Data Rate DRAM Interface for High-Performance PCs Dong KIM, Kwanhu BANG, Seung-Hwan HA, Chanik PARK, Sung Woo
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationOSSD: A Case for Object-based Solid State Drives
MSST 2013 2013/5/10 OSSD: A Case for Object-based Solid State Drives Young-Sik Lee Sang-Hoon Kim, Seungryoul Maeng, KAIST Jaesoo Lee, Chanik Park, Samsung Jin-Soo Kim, Sungkyunkwan Univ. SSD Desktop Laptop
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationUnblinding the OS to Optimize User-Perceived Flash SSD Latency
Unblinding the OS to Optimize User-Perceived Flash SSD Latency Woong Shin *, Jaehyun Park **, Heon Y. Yeom * * Seoul National University ** Arizona State University USENIX HotStorage 2016 Jun. 21, 2016
More informationSummarizer: Trading Communication with Computing Near Storage
Summarizer: Trading Communication with Computing Near Storage Gunjae Koo*, Kiran Kumar Matam*, Te I, H.V. Krishina Giri Nara*, Jing Li, Hung-Wei Tseng, Steven Swanson, Murali Annavaram* *University of
More informationEnabling Cost-effective Data Processing with Smart SSD
Enabling Cost-effective Data Processing with Smart SSD Yangwook Kang, UC Santa Cruz Yang-suk Kee, Samsung Semiconductor Ethan L. Miller, UC Santa Cruz Chanik Park, Samsung Electronics Efficient Use of
More informationHybrid Memory Platform
Hybrid Memory Platform Kenneth Wright, Sr. Driector Rambus / Emerging Solutions Division Join the Conversation #OpenPOWERSummit 1 Outline The problem / The opportunity Project goals Roadmap - Sub-projects/Tracks
More informationPresented by: Nafiseh Mahmoudi Spring 2017
Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory
More informationOptimizing Translation Information Management in NAND Flash Memory Storage Systems
Optimizing Translation Information Management in NAND Flash Memory Storage Systems Qi Zhang 1, Xuandong Li 1, Linzhang Wang 1, Tian Zhang 1 Yi Wang 2 and Zili Shao 2 1 State Key Laboratory for Novel Software
More informationCSCI-GA Database Systems Lecture 8: Physical Schema: Storage
CSCI-GA.2433-001 Database Systems Lecture 8: Physical Schema: Storage Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com View 1 View 2 View 3 Conceptual Schema Physical Schema 1. Create a
More informationPerformance Benefits of Running RocksDB on Samsung NVMe SSDs
Performance Benefits of Running RocksDB on Samsung NVMe SSDs A Detailed Analysis 25 Samsung Semiconductor Inc. Executive Summary The industry has been experiencing an exponential data explosion over the
More informationDongjun Shin Samsung Electronics
2014.10.31. Dongjun Shin Samsung Electronics Contents 2 Background Understanding CPU behavior Experiments Improvement idea Revisiting Linux I/O stack Conclusion Background Definition 3 CPU bound A computer
More informationDecentralized and Distributed Machine Learning Model Training with Actors
Decentralized and Distributed Machine Learning Model Training with Actors Travis Addair Stanford University taddair@stanford.edu Abstract Training a machine learning model with terabytes to petabytes of
More informationParallel Stochastic Gradient Descent: The case for native GPU-side GPI
Parallel Stochastic Gradient Descent: The case for native GPU-side GPI J. Keuper Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Mark Silberstein Accelerated Computer
More informationToday s Presentation
Banish the I/O: Together, SSD and Main Memory Storage Accelerate Database Performance Today s Presentation Conventional Database Performance Optimization Goal: Minimize I/O Legacy Approach: Cache 21 st
More informationBeyond Block I/O: Rethinking
Beyond Block I/O: Rethinking Traditional Storage Primitives Xiangyong Ouyang *, David Nellans, Robert Wipfel, David idflynn, D. K. Panda * * The Ohio State University Fusion io Agenda Introduction and
More informationBaoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China
doi:10.21311/001.39.7.41 Implementation of Cache Schedule Strategy in Solid-state Disk Baoping Wang School of software, Nanyang Normal University, Nanyang 473061, Henan, China Chao Yin* School of Information
More informationSolid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Solid State Storage Technologies Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu NVMe (1) The industry standard interface for high-performance NVM
More informationA Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo
A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo 1 June 4, 2011 2 Outline Introduction System Architecture A Multi-Chipped
More informationMoneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories
Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems
More informationTRADITIONAL search engines utilize hard disk drives
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/TC.216.268818,
More informationNEXT-GENERATION SSD VERIFICATION PLATFORM WITH TERA-SCALE NAND CAPACITY AND STORAGE SIGNAL PROCESSING SUPPORT
NEXT-GENERATION SSD VERIFICATION PLATFORM WITH TERA-SCALE CAPACITY AND STORAGE SIGNAL PROCESSING SUPPORT NVRAMOS 2012-10-16 Seil Lee, Myunghyun Rhee, and Sungroh Yoon Advanced Computing Laboratory, Seoul
More informationSoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research
SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research 1 The world s most valuable resource Data is everywhere! May. 2017 Values from Data! Need infrastructures for
More informationLinux Storage System Bottleneck Exploration
Linux Storage System Bottleneck Exploration Bean Huo / Zoltan Szubbocsev Beanhuo@micron.com / zszubbocsev@micron.com 215 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications
More informationToward Scalable Deep Learning
한국정보과학회 인공지능소사이어티 머신러닝연구회 두번째딥러닝워크샵 2015.10.16 Toward Scalable Deep Learning 윤성로 Electrical and Computer Engineering Seoul National University http://data.snu.ac.kr Breakthrough: Big Data + Machine Learning
More informationHARD Disk Drives (HDDs) have been widely used as mass
878 IEEE TRANSACTIONS ON COMPUTERS, VOL. 59, NO. 7, JULY 2010 Architecture Exploration of High-Performance PCs with a Solid-State Disk Dong Kim, Kwanhu Bang, Student Member, IEEE, Seung-Hwan Ha, Sungroh
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationChapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk
Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk Young-Joon Jang and Dongkun Shin Abstract Recent SSDs use parallel architectures with multi-channel and multiway, and manages multiple
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationToward SLO Complying SSDs Through OPS Isolation
Toward SLO Complying SSDs Through OPS Isolation October 23, 2015 Hongik University UNIST (Ulsan National Institute of Science & Technology) Sam H. Noh 1 Outline Part 1: FAST 2015 Part 2: Beyond FAST 2
More informationMQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices
MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose, Onur Mutlu February 13, 2018 Executive Summary
More informationCOMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare
COMP6237 Data Mining Data Mining & Machine Learning with Big Data Jonathon Hare jsh2@ecs.soton.ac.uk Contents Going to look at two case-studies looking at how we can make machine-learning algorithms work
More informationBlueDBM: An Appliance for Big Data Analytics*
BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting
More informationCSEE W4824 Computer Architecture Fall 2012
CSEE W4824 Computer Architecture Fall 2012 Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches Luca Carloni Department of Computer Science Columbia University in the City of
More informationHigh-Speed NAND Flash
High-Speed NAND Flash Design Considerations to Maximize Performance Presented by: Robert Pierce Sr. Director, NAND Flash Denali Software, Inc. History of NAND Bandwidth Trend MB/s 20 60 80 100 200 The
More informationPerformance Modeling and Analysis of Flash based Storage Devices
Performance Modeling and Analysis of Flash based Storage Devices H. Howie Huang, Shan Li George Washington University Alex Szalay, Andreas Terzis Johns Hopkins University MSST 11 May 26, 2011 NAND Flash
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationSSDs vs HDDs for DBMS by Glen Berseth York University, Toronto
SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto So slow So cheap So heavy So fast So expensive So efficient NAND based flash memory Retains memory without power It works by trapping a small
More informationCatapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud
Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s
More informationJoin Processing for Flash SSDs: Remembering Past Lessons
Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationBenchmark: In-Memory Database System (IMDS) Deployed on NVDIMM
Benchmark: In-Memory Database System (IMDS) Deployed on NVDIMM Presented by Steve Graves, McObject and Jeff Chang, AgigA Tech Santa Clara, CA 1 The Problem: Memory Latency NON-VOLATILE MEMORY HIERARCHY
More informationBring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016
Bring x3 Spark Performance Improvement with PCIe SSD Yucai, Yu (yucai.yu@intel.com) BDT/STO/SSG January, 2016 About me/us Me: Spark contributor, previous on virtualization, storage, mobile/iot OS. Intel
More informationFlash Memory Based Storage System
Flash Memory Based Storage System References SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers, ISLPED 06 Energy-Aware Flash Memory Management in Virtual Memory System, islped
More informationOpenSSD Platform Simulator to Reduce SSD Firmware Test Time. Taedong Jung, Yongmyoung Lee, Ilhoon Shin
OpenSSD Platform Simulator to Reduce SSD Firmware Test Time Taedong Jung, Yongmyoung Lee, Ilhoon Shin Department of Electronic Engineering, Seoul National University of Science and Technology, South Korea
More informationReducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck
More informationLocality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example
Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near
More informationLarge and Fast: Exploiting Memory Hierarchy
CSE 431: Introduction to Operating Systems Large and Fast: Exploiting Memory Hierarchy Gojko Babić 10/5/018 Memory Hierarchy A computer system contains a hierarchy of storage devices with different costs,
More informationPCIe Storage Beyond SSDs
PCIe Storage Beyond SSDs Fabian Trumper NVM Solutions Group PMC-Sierra Santa Clara, CA 1 Classic Memory / Storage Hierarchy FAST, VOLATILE CPU Cache DRAM Performance Gap Performance Tier (SSDs) SLOW, NON-VOLATILE
More informationDeveloping Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017
Developing Low Latency NVMe Systems for HyperscaleData Centers Prepared by Engling Yeo Santa Clara, CA 95054 Date: 08/04/2017 Quality of Service IOPS, Throughput, Latency Short predictable read latencies
More informationDesigning a True Direct-Access File System with DevFS
Designing a True Direct-Access File System with DevFS Sudarsun Kannan, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau University of Wisconsin-Madison Yuangang Wang, Jun Xu, Gopinath Palani Huawei Technologies
More informationarxiv: v1 [cs.ar] 11 Apr 2017
FMMU: A Hardware-Automated Flash Map Management Unit for Scalable Performance of NAND Flash-Based SSDs Yeong-Jae Woo Sang Lyul Min Department of Computer Science and Engineering, Seoul National University
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationThe Long-Term Future of Solid State Storage Jim Handy Objective Analysis
The Long-Term Future of Solid State Storage Jim Handy Objective Analysis Agenda How did we get here? Why it s suboptimal How we move ahead Why now? DRAM speed scaling Changing role of NVM in computing
More informationMiddleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives
Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives Chao Sun 1, Asuka Arakawa 1, Ayumi Soga 1, Chihiro Matsui 1 and Ken Takeuchi 1 1 Chuo University Santa Clara,
More informationHADP Talk BlueDBM: An appliance for Big Data Analytics
HADP Talk BlueDBM: An appliance for Big Data Analytics Sang-Woo Jun* Ming Liu* Sungjin Lee* Jamey Hicks+ John Ankcorn+ Myron King+ Shuotao Xu* Arvind* *MIT Computer Science and Artificial Intelligence
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationDon t Forget the Memory. Dean Klein, VP Memory System Development Micron Technology, Inc.
Don t Forget the Memory Dean Klein, VP Memory System Development Micron Technology, Inc. Memory is Everywhere 2 One size DOES NOT fit all 3 Question: How many different memories does your computer use?
More informationHandling Memory Accesses in Big Data Environment Chipex 2016
Professor Uri Weiser Technion Haifa, Israel Handling Memory Accesses in Big Data Environment Chipex 2016 The talk covers research done by: T. Horowitz, Prof. A. Kolodny, T. Morad,, Prof. A. Mendelson,
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP630030 Data Intensive Computing Report, 2013 Yifu Huang (FDU CS) COMP630030 Report
More informationSOS : Software-based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs
SOS : Software-based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs Sangwook Shane Hahn, Sungjin Lee and Jihong Kim Computer Architecture & Embedded Systems Laboratory School of Computer
More informationNOHOST: A New Storage Architecture for Distributed Storage Systems. Chanwoo Chung
NOHOST: A New Storage Architecture for Distributed Storage Systems by Chanwoo Chung B.S., Seoul National University (2014) Submitted to the Department of Electrical Engineering and Computer Science in
More informationD E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager
1 T HE D E N A L I N E X T - G E N E R A T I O N H I G H - D E N S I T Y S T O R A G E I N T E R F A C E Laura Caulfield Senior Software Engineer Arie van der Hoeven Principal Program Manager Outline Technology
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationPersistent Memory Productization driven by AI & ML. Danny Sabour VP Marketing, Avalanche Technology
Persistent Memory Productization driven by AI & ML Danny Sabour VP Marketing, Avalanche Technology Persistent Memory Usage from Cloud to Node CLOUD Compute Storage Deep Learning Training Big data processing
More informationCOMP283-Lecture 3 Applied Database Management
COMP283-Lecture 3 Applied Database Management Introduction DB Design Continued Disk Sizing Disk Types & Controllers DB Capacity 1 COMP283-Lecture 3 DB Storage: Linear Growth Disk space requirements increases
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationMemory Expansion Technology Using Software-Controlled SSD
Memory Expansion Technology Using Software-Controlled SSD S. Kazama*, S. Gokita*, S. Kuwamura*, E. Yoshida*, J. Ogawa*, Y. Honda** *Fujitsu Laboratories Ltd. **Fujitsu Ltd. Contact: sc-ssd-fms2017@ml.labs.fujitsu.com
More informationFlash-Conscious Cache Population for Enterprise Database Workloads
IBM Research ADMS 214 1 st September 214 Flash-Conscious Cache Population for Enterprise Database Workloads Hyojun Kim, Ioannis Koltsidas, Nikolas Ioannou, Sangeetha Seshadri, Paul Muench, Clem Dickey,
More informationNext Generation Architecture for NVM Express SSD
Next Generation Architecture for NVM Express SSD Dan Mahoney CEO Fastor Systems Copyright 2014, PCI-SIG, All Rights Reserved 1 NVMExpress Key Characteristics Highest performance, lowest latency SSD interface
More informationNAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
NAND Flash-based Storage Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Today s Topics NAND flash memory Flash Translation Layer (FTL) OS implications
More informationSHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device
SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device Hyukjoong Kim 1, Dongkun Shin 1, Yun Ho Jeong 2 and Kyung Ho Kim 2 1 Samsung Electronics
More informationLearning with Purpose
Network Measurement for 100Gbps Links Using Multicore Processors Xiaoban Wu, Dr. Peilong Li, Dr. Yongyi Ran, Prof. Yan Luo Department of Electrical and Computer Engineering University of Massachusetts
More informationstec Host Cache Solution
White Paper stec Host Cache Solution EnhanceIO SSD Cache Software and the stec s1120 PCIe Accelerator speed decision support system (DSS) workloads and free up disk I/O resources for other applications.
More informationDCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture
DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationLightweight KV-based Distributed Store for Datacenters
Lightweight KV-based Distributed Store for Datacenters Chanwoo Chung, Jinhyung Koo*, Arvind, and Sungjin Lee Massachusetts Institute of Technology (MIT) Daegu Gyeongbuk Institute of Science & Technology
More informationFlash Trends: Challenges and Future
Flash Trends: Challenges and Future John D. Davis work done at Microsoft Researcher- Silicon Valley in collaboration with Laura Caulfield*, Steve Swanson*, UCSD* 1 My Research Areas of Interest Flash characteristics
More informationHigh Performance SSD & Benefit for Server Application
High Performance SSD & Benefit for Server Application AUG 12 th, 2008 Tony Park Marketing INDILINX Co., Ltd. 2008-08-20 1 HDD SATA 3Gbps Memory PCI-e 10G Eth 120MB/s 300MB/s 8GB/s 2GB/s 1GB/s SSD SATA
More informationIEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. X, XXXXX SSD In-Storage Computing for Search Engines
IEEE TRANSACTIONS ON COMPUTERS, VOL. 66, NO. X, XXXXX 2017 1 SSD In-Storage Computing for Search Engines Jianguo Wang, Dongchul Park, Yannis Papakonstantinou, and Steven Swanson Abstract SSD in-storage
More informationWorkload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013
Workload Optimized Systems: The Wheel of Reincarnation Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013 Outline Definition Technology Minicomputers Prime Workstations Apollo Graphics
More informationCS 261 Fall Mike Lam, Professor. Memory
CS 261 Fall 2016 Mike Lam, Professor Memory Topics Memory hierarchy overview Storage technologies SRAM DRAM PROM / flash Disk storage Tape and network storage I/O architecture Storage trends Latency comparisons
More informationImplica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io
Implica(ons of Non Vola(le Memory on So5ware Architectures Nisha Talagala Lead Architect, Fusion- io Overview Non Vola;le Memory Technology NVM in the Datacenter Op;mizing sobware for the iomemory Tier
More informationBringing Intelligence to Enterprise Storage Drives
Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely
More informationLinux Storage System Analysis for e.mmc With Command Queuing
Linux Storage System Analysis for e.mmc With Command Queuing Linux is a widely used embedded OS that also manages block devices such as e.mmc, UFS and SSD. Traditionally, advanced embedded systems have
More informationKey Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs
IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits
More informationNVMe : Redefining the Hardware/Software Architecture
NVMe : Redefining the Hardware/Software Architecture Jérôme Gaysse, IP-Maker Santa Clara, CA 1 NVMe Protocol How to implement the NVMe protocol? SW, HW/SW or HW? 2- NVMe command ready CPU 1-Host driver
More informationI/O. Disclaimer: some slides are adopted from book authors slides with permission 1
I/O Disclaimer: some slides are adopted from book authors slides with permission 1 Thrashing Recap A processes is busy swapping pages in and out Memory-mapped I/O map a file on disk onto the memory space
More informationExploring System Challenges of Ultra-Low Latency Solid State Drives
Exploring System Challenges of Ultra-Low Latency Solid State Drives Sungjoon Koh Changrim Lee, Miryeong Kwon, and Myoungsoo Jung Computer Architecture and Memory systems Lab Executive Summary Motivation.
More informationSSD Applications in the Enterprise Area
SSD Applications in the Enterprise Area Tony Kim Samsung Semiconductor January 8, 2010 Outline Part I: SSD Market Outlook Application Trends Part II: Challenge of Enterprise MLC SSD Understanding SSD Lifetime
More informationIt Takes Guts to be Great
It Takes Guts to be Great Sean Stead, STEC Tutorial C-11: Enterprise SSDs Tues Aug 21, 2012 8:30 to 11:20AM 1 Who s Inside Your SSD? Full Data Path Protection Host Interface It s What s On The Inside That
More information