Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Similar documents
High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Energy Optimizations for FPGA-based 2-D FFT Architecture

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

High-performance Pipelined Architecture for Tree-based IP lookup Engine on FPGA*

High-throughput Online Hash Table on FPGA*

FPGA: What? Why? Marco D. Santambrogio

Large-scale Multi-flow Regular Expression Matching on FPGA*

Dynamically Configurable Online Statistical Flow Feature Extractor on FPGA

Introduction to Field Programmable Gate Arrays

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

High-Performance Packet Classification on GPU

FPGA Matrix Multiplier

Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs

OPTIMIZING INTERCONNECTION COMPLEXITY FOR REALIZING FIXED PERMUTATION IN DATA AND SIGNAL PROCESSING ALGORITHMS. Ren Chen and Viktor K.

High Throughput Sketch Based Online Heavy Change Detection on FPGA

Automation Framework for Large-Scale Regular Expression Matching on FPGA. Thilan Ganegedara, Yi-Hua E. Yang, Viktor K. Prasanna

Parallel graph traversal for FPGA

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Don t Forget Memories A Case Study Redesigning a Pattern Counting ASIC Circuit for FPGAs

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

Network-on-chip (NOC) Topologies

SCALABLE HIGH-THROUGHPUT SRAM-BASED ARCHITECTURE FOR IP-LOOKUP USING FPGA. Hoang Le, Weirong Jiang, Viktor K. Prasanna

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Scalable Packet Classification on FPGA

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

EITF35: Introduction to Structured VLSI Design

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

INTRODUCTION TO FPGA ARCHITECTURE

FlexRIO. FPGAs Bringing Custom Functionality to Instruments. Ravichandran Raghavan Technical Marketing Engineer. ni.com

Multi-core Implementation of Decomposition-based Packet Classification Algorithms 1

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Technology Mapping and Packing. FPGAs

Digital system (SoC) design for lowcomplexity. Hyun Kim

Architecture and Performance Models for Scalable IP Lookup Engines on FPGA*

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Scalable High Throughput and Power Efficient IP-Lookup on FPGA

An Efficient Implementation of LZW Compression in the FPGA

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform

Field Programmable Gate Array (FPGA)

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES

High Performance Architecture for Flow-Table Lookup in SDN on FPGA

Resource-Efficient SRAM-based Ternary Content Addressable Memory

Analysis of High-performance Floating-point Arithmetic on FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

ECE 331 Digital System Design

Line-rate packet processing in hardware: the evolution towards 400 Gbit/s

CS310 Embedded Computer Systems. Maeng

Multi-dimensional Packet Classification on FPGA: 100 Gbps and Beyond

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

IP packet forwarding, or simply, IP-lookup, is a classic

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

User Manual for FC100

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

What is Xilinx Design Language?

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform*

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

MULTI-RATE HIGH-THROUGHPUT LDPC DECODER: TRADEOFF ANALYSIS BETWEEN DECODING THROUGHPUT AND AREA

Advanced FPGA Design Methodologies with Xilinx Vivado

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

DESIGN OF AN FFT PROCESSOR

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Memories: a practical primer

HES-7 ASIC Prototyping

A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs

CDA 4253 FPGA System Design Op7miza7on Techniques. Hao Zheng Comp S ci & Eng Univ of South Florida

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Utilizing SDSoC to Port Convolutional Neural Network to a Space-grade FPGA

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

Scalable Ternary Content Addressable Memory Implementation Using FPGAs

AES1. Ultra-Compact Advanced Encryption Standard Core AES1. General Description. Base Core Features. Symbol. Applications

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

A Distributed Canny Edge Detector and Its Implementation on FPGA

XPU A Programmable FPGA Accelerator for Diverse Workloads

The Virtex FPGA and Introduction to design techniques

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Simplify System Complexity

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

A Reconfigurable Architecture for Load-Balanced Rendering

Optimizing Memory Performance for FPGA Implementation of PageRank

Multivariate Time Series Classification Using Inter-leaved Shapelets

High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion

Power Efficient Register Assignment

Scalable Packet Classification on FPGA

Transcription:

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089 Email: yunqu@usc.edu prasanna@usc.edu

Outline Introduction Background Data Structures & Algorithms Architecture Evaluation Conclusion 2

Introduction Outline High-performance computing Background Data Structures & Algorithms Architecture Evaluation Conclusion 3

High-performance Computing (HPC) HPC Superior performance Image/signal processing Scientific computing, simulation Networking [1] Performance High throughput Fast updates [1] http://www.bgp.com.cn/service/processing.html 4

Emerging Trends in HPC Software centralization Adapt to the change in demand Dynamically updatable Heterogeneous computing Accelerated by hardware FPGA, GPU, VLSI chips, etc. 5

Introduction Background Decision-trees Challenges Outline Data Structures & Algorithms Architecture Evaluation Conclusion 6

Decision-trees in HPC Usage Data mining, classification Our focus Problem definition Given a decision-tree NN leaf nodes, MM fields High-throughput classification Support dynamic updates Decision-tree 7

State-of-the-art Implementation Straightforward mapping onto hardware [2][3] Each level Processing Element (PE) No. of levels Memory consumption per stage Decision-tree pipeline PE PE PE [2] D. Sheldon and F. Vahid, Don t Forget Memories: A Case Study Redesigning a Pattern Counting ASIC Circuit for FPGAs, CODES+ISSS, 2008. [3] Y.-H. E. Yang and V. K. Prasanna, High Throughput and Large Capacity Pipelined Dynamic Search Tree on FPGA, FPGA, 2010. 8

Scalability Challenges (1) Balanced: OO(log NN) lookup time, OO(NN) wire length Imbalanced: OO(1) wire length, OO(NN) lookup time NN performance (clock rate, throughput) Challenge 1: a scalable implementation Binary tree 5 2 8 1 3 6 9 Wire length: OO(NN) 9

Dynamic updates Challenges (2) Deletion/insertion: nodes being pushed around Modification: as complex as deletion/insertion Global changes: a single update too many operations Challenge 2: a dynamically updatable implementation 5 6 2 8 1 3 6 9 update 2 8 1 3 9 10

Field Programmable Gate Array Logic Cell Matrix Programmable k 0 1.. Logic Cell D SET CLR Q Q Logic elements 0 1 D SET CLR Q Q Lookup Table (LUT) Programmable interconnect Connect logic cells complex functions Interconnect Long wire Short wire 11

Outline Introduction Background Data Structures & Algorithms Rule table Optimizations Dynamic updates Architecture Evaluation Conclusion 12

Rule table Converting Decision-trees Alternative representation of a decision-tree Each leaf node a root-to-leaf path a rule Intuition: parallel searching on the rule table NN = 5 rules, MM = 2 fields, WW = 28 data width 13

Optimization Techniques Splitting (vertically) and partitioning (horizontally) Multiple WW mm -bit sub-fields Multiple sub-tables having NN nn rules Split Mapped to a PE Partition 14

Dynamic Updates (1) Modification of a tree node Identify the sub-rules to be modified Change the values stored, or Change the polarity of the comparison modify this node modify these two rules 15

Dynamic Updates (2) Deletion of a tree node Modify sub-rule(s) to produce invalid sub-rule(s) delete this node modify these two rules 16

Dynamic Updates (3) Inserting an internal node All the descendants of the new node has to be modified Inserting a leaf node Pre-allocate (NN max NN) invalid rules as placeholders Modifying an invalid rule into the new rule 17

Outline Introduction Background Data Structures & Algorithms Architecture 2-dimensional pipelined architecture Evaluation Conclusion 18

Architecture 2-dimensional pipelined architecture Each sub-table a modular PE Updates: memory write accesses propagate diagonally Modular PE 19

Outline Introduction Background Data Structures & Algorithms Architecture Evaluation Peak throughput Resource consumption Conclusion 20

Experimental Setups Prototypes on state-of-the-art FPGA Dual-port distributed RAM (distram) Empirical optimizations: WW mm = 4, NN nn = 2 FPGA device Virtex-7 XC7VX1140t FLG1930-2L No. of I/O pins 1100 No. of logic slices 218800 BRAM / distram 67 Mb / up to 17 Mb Design Suite Xilinx Vivado 2013.4 Programming Language Verilog Timing constraint 250 MHz Reports Post-place-and-route 21

Peak Throughput Comparison with state-of-the-art implementation 1.8~2.0 peak throughput (without any updates) Sustain high lookup throughput as NN max and/or WW 22

Resource Consumption No. of logic slices, no. of I/O pins Largest design: NN = 512 leaf nodes, WW = 320 bits data Resources consumed evenly 23

Outline Introduction Background Data Structures & Algorithms Architecture Evaluation Conclusion 24

Conclusion Scalable and dynamically updatable lookup engine Mapping decision-trees onto hardware more efficiently Works for generic decision-trees in HPC Future work Self-learning on this architecture Investigate performance tradeoffs on various platforms 25

Questions? Thanks http://ganges.usc.edu