Article begins on next page

Size: px

Start display at page:

Download "Article begins on next page"

Clinton Goodman
6 years ago
Views:

1 Title: A 19.4 nj/ 364K s/s in-memory random forest classifier in 6T SRAM array Archived version Accepted manuscript: the content is identical to the published paper, but without the final typesetting by the publisher Published version DOI : DOI: 1.119/ESSCIRC Conference homepage Authors (contact) Mingu Kang (mkang17@illinois.edu) Sujan K. Gonugondla (gonugon2@illinois.edu) Naresh R. Shanbhag (shanbhag@illinois.edu) Affiliation University of Illinois at Urbana Champaign Article begins on next page

2 A 19.4 nj/ 364K s/s In-memory Random Forest Classifier in 6T SRAM Array Mingu Kang, Sujan K. Gonugondla, Naresh R. Shanbhag Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA. Abstract This paper presents IC realization of a random forest (RF) machine learning classifier. Algorithm-architecturecircuit is co-optimized to minimize the energy-delay product (EDP). Deterministic subsampling (DSS) and balanced trees result in reduced interconnect complexity and avoid irregular memory accesses. Low-swing analog in-memory computations embedded in a standard 6T SRAM enable massively parallel processing thereby minimizing the memory fetches and reducing the EDP further. The 65nm CMOS prototype achieves a 6.8 lower EDP compared to a conventional design at the same accuracy (94%) for an 8-class traffic sign recognition problem. Keywords machine learning; random forest; in-memory computing; pattern recognition; traffic sign recognition I. INTRODUCTION The random forest (RF) classifier [1] is attractive due to its high-accuracy, simple operations (comparisons), applicability to multi-class problems, and robustness to non-ideal computations due to its majority voting based- [1]. However, realizing an energy-efficient implementation of the RF algorithm is made challenging due to its high data access rate combined with its highly irregular data access pattern. This paper presents an energy-efficient and high throughput RF classifier IC by employing: 1) deterministic subsampling (DSS) to reduce interconnect complexity, 2) balanced tree to regularize memory access pattern, 3) deeply embedded analog computations [3,4,5] in the periphery of an SRAM bitcell array (BCA) to exploit the inherent algorithmic error tolerance. To the best of our knowledge, this is the first IC implementation of the RF algorithm as only FPGAs, GPUs, and multi-core processor implementations of the RF algorithm [2] exist today. These fail to take advantage of the opportunities afforded by analog computations. II. THE RF ALGORITHM This section explains the RF algorithm and its implementation challenges. A. RF Algorithm P 1 leaf nodes chosen path in each tree Input (X) RSS RSS RSS tree 1 tree 2 tree M label1 P 2 label2 voter P M node m,n > yes τ m,n labelm no P1 RSS RSS RSS balanced tree 1 label1 Input (X) P2 label2 4:1 DSS balanced tree 2 voter PM balanced tree M labelm OPs per tree Memory accesses Data p m,n c m,l τ m,n Comp. > τ m,n 8 Bit precision 6 4 /4 8 Size 21.5 (Byte) /31 /16 /31 - # of 3.5 OPs /4 /2 /4 - proposed / conventional 8 bytes per SRAM access assumed 31 /31 Cross bar Mux ratio 64:1 /256:1 1/1 τ m,n: threshold level of n th node in m th tree p m,n: pixel index of n th node in m th tree P m: [p m,1, p m,2, p m,n] RSS: Random subsampling by sample pattern P m : p m,n th pixel of input image X c m,l: label corresponding to l th leaf node in m th tree (m: 1~M, n: 1~N, l: 1~N+1) (a) (b) (c) Fig. 1. Random forest algorithm: (a) conventional, (b) proposed w/ deterministic subsample (DSS), and (c) number of required operations.

3 The RF algorithm (Fig. 1(a)) consists of M trees. The m-th tree processes data obtained by random subsampling (RSS) the input image (X) using a pseudo-random pattern vector P m. The n-th node in the m-th tree compares x(p (m,n) ), which is the pixel (or feature) indexed by p (m,n), with a threshold τ (m,n) to obtain a node-level binary q (m,n). Either the left or right branch is taken based on q (m,n). This process is repeated until a leaf node is reached. The label c (m,l) corresponding to the l-th leaf node is the tree-level. The final is obtained by majority-voting the M treelevel s. B. Implementation Challenges Two different architectures can be considered to implement the RF algorithm: serial and parallel architectures. A serial architecture needs to process nodes sequentially resulting in large delay and requires reading of two 11-b (for a 16 KB array) child node addresses per node, which takes roughly of the storage space. On the other hand, a fully parallel architecture computes all q (m,n) in parallel and uses these to address a look-up-table (LUT) to obtain c (m,l). Doing so requires a large number of memory accesses, e.g., 78 8-b bytes per tree (Fig. 1(c)), which in turn limits the achievable throughput and energy efficiency. Additionally, a complex (i.e., 256:1 with image X) crossbar is needed to route the pixel indexed by p (m,n) from X for comparison. III. THE PROPOSED RF ALGORITHM AND ARCHITECTURE This paper co-optimizes the algorithm and architecture to achieve energy and throughput benefits. A. The Proposed RF Algorithm The modified RF algorithm (Fig. 1(b)) employs a fixed pattern deterministic subsampling (DSS) step prior to RSS to solve the crossbar problem mentioned above. A 4:1 DSS factor is chosen to balance the loss in classification accuracy with crossbar complexity. The complexity of the RSS crossbar is reduced from 256:1 to 64:1 when the input X is a image. Thus, the precision of p (m,n) is also reduced from 8-b to 6-b. Additionally, the trees are balanced (Fig. 1(b)) by filling some empty nodes in order to regularize the memory access pattern. The memory access problem is addressed by reducing the number of memory accesses via in-memory comparison (Fig. 3) eliminating the need to fetch τ (m,n). The Class ADD generator (CAG) generates the address of chosen c (m,l) from q (m,n) s eliminating the need to fetch all the c (m,l) s. Only 24.5 bytes of data need to be fetched per tree compared to 78 bytes/tree in the parallel architecture. B. Proposed Architecture and Operations The proposed RF architecture (Fig. 2(a)) includes a SRAM BCA, multi-row wordline (WL) driver, 64-b I/O with a 4:1 column mux, DSS input buffer to store streamed X, RSS crossbars, CAG, label finder, majority voter, and the peripherals for standard read/write operations. A group of four trees are processed in parallel and 16 such groups are processed sequentially for a total of M = 64 trees. The classifier first: 1) writes the pixel index register, 2) enables crossbar, 3) does inmemory comparison enabled by multi-row WL driver and analog comparators, 4) sequentially fetches four tree-level labels using address generated by CAG, and 5) majority votes CORE CTRL Group 1 Group M DSS RSS In-memory comparison 64-b BUS Input buffer (X) with DSS X 1,5,,253 X 2,6,,254 X 3,7,,255 X 4,8,,256 (x) (x) SRAM Replica bit-cell array T SRAM bit-cell array tree 1 Group 1 tree 2 tree 3 tree 4 p 1,1~31 p 2,1~31 p 3,1~31 p 4,1~31 τ 1,1~31 τ 2,1~31 τ 3,1~31 τ 4,1~31 c 1,1~32 c 2,1~32 c 3,1~32 c 4,1~32 Group 2 Group 42 (x) Normal read/write circuitry (X(p m,n)) 64-b IO p m~(m+3),1~31 (x) Multi-row WL driver w/ row dec. m~(m+3) EN EN q[1:4] CTRL, ADD CTRL, ADD m class ADD gen. (CAG) ADDm~(m+3) voter - : pixel index register - : crossbar - : RSS register - : analog comparators Pixel index Cross bar Enable Replica cell Write In-memory Comp. vote Pixel index tree 1 tree 2 tree 3 12 reads P 1,1~31 1 MR-read 2 reads 3 reads 3 reads 3 reads 3 reads P 2,1~31 P 3,1~31 tree 4 P 4,1~31 row i row (i+1) left right left right 1 read 1 read 32 bits including 1 32 bits including 2 1 MR-read 2 reads 32 bits including 3 32 bits including 4 (a) Fig. 2. Proposed RF: (a) architecture, and (b) timing diagram. (b)

4 in the final tree. C. In-memory comparison In-memory comparison requires the 8-b thresholds τ (m,n) (T in Fig. 3(a)) and the indexed pixels x(p (m,n) ) (X in Fig. 3(a)) to be stored in a column major pattern, i.e., bits of a word are stored in a column. The comparison begins with the simultaneous application of WL access pulses with binaryweighted pulse widths to all the rows storing T and X_B. Here, the pulse width is proportional to the bit position. Doing so creates a bitline (BL) voltage swing ΔV BLB (ΔV BL ) proportional to T-X (X-T) [3,4]. Linearity of this multi-row read is improved by reading 4-b MSBs and LSBs separately from adjacent columns followed by a capacitively-weighted charge sharing that assigns 16 greater weight to the MSBs. The WL voltage is reduced (e.g.,.65v) to prevent destructive read and improve the linearity further. Storing the X_B in the replica bit-cell array allows fast writing through a separate write BL (WBL) and wordline (WWL) by eliminating the overheads of slow write operation into normal BCA. The feed into analog comparators to generate node-level s (q). In-memory comparison is an intrinsically and massively parallel operation as it processes all b words in parallel from 256 columns whereas conventional memory fetches only 64 bits (= 8 words) per read access when the sense amplifier is shared across four columns. In addition, multi-row read saves energy by accessing 4 bits per precharge. A. Component-level Accuracy Characterization Measured in-memory comparison results show (Fig. 3(b)) the comparator error rate increasing from 1.6% to 14.5% as ΔV BL reduces from 25mV to 5mV. The RF algorithm with 64 trees needs an error rate of less than 9.5% at comparator output q to avoid a discernable 8-class classification accuracy loss. Four trees tolerate only 4% error restricting further reduction in ΔV BL. B. Application-level Accuracy, Energy, and Throughput Measured results (Fig. 4) of energy vs. accuracy trade-off for the binary classification (face detection) with 64 trees show the proposed IC achieves a 3.1 energy savings over the conventional architecture (SRAM + digital processor). The energy of the conventional architecture is obtained via postlayout simulations of the digital blocks and read access energy measured from the prototype IC. This energy savings come from multi-row read, in-memory comparison, and lowcomplexity cross bar. Fewer memory accesses also reduce the delay by 2.2 over a conventional architecture, thereby providing a 6.8 lower energy-delay product (EDP) at the same accuracy of > 93% as the conventional architecture. The prototype IC achieves a throughput of 364K s/s and energy efficiency of 19.4 nj/, achieving at least 5.6 smaller EDP compared to prior multi-class classifier ICs as listed in Table II. _EN ΔV BL RWL WWL RWL 1 WWL 1 RWL 2 WWL 2 RWL 3 WWL 3 WL i+ WL i+1 WL i+2 WL i+3 WBL IV. BL q > BLB x 3 x 2 x 1 x t 3 t 2 t 1 t (a) CHIP MEASURED RESULTS The in-memory RF classifier is implemented in a 65nm CMOS process (chip micrograph in Fig. 5 and summarized in Table I) to prove the application-level s benefits. ΔV BLB Replica bit-cells 6-T SRAM bit-cells Comparison error rate (%) WL i+ & RWL WL i+1 & RWL 1 WL i+2 & RWL 2 WL i+3 & RWL 2 _EN q *minimum ΔV BL to achieve classification accuracy 93% V WL<V DD ΔV BL X + T X T ΔV BLB X + T T X 1 if X > T ( V BL< VBLB) q =, otherwise with 64 trees* with 4 trees* ΔV BL per LSB (mv) Fig. 3. In-memory comparison: (a) bit-cell column for in-memory comparison of T and X, and (b) measured accuracy of comparison. (b) Core energy per (nj) Proposed Energy Proposed Accuracy Conv. Energy Conv. Accuracy ΔV BL per LSB for proposed (mv) Fig. 4. Energy vs. error rate w.r.t ΔV BL with 64 trees (binary classification), where ΔV BL of Conv. = 8 ΔV BL per LSB Classification error rate (1-P DET ) (%)

Table I: Chip summary. Technology 65 nm CMOS Die size 1.2 1.2 mm SRAM capacity 16 KB (512 256 bit-cells) Bit-cell size 2.11.92 um 2 CTRL CLK freq.

6M, 364k) This paper has presented an IC realization of random forest (RF) algorithm to achieve energy-efficient and high throughput by co-optimizing algorithm, architecture, and circuit design.

5 Table I: Chip summary. Technology 65 nm CMOS Die size mm SRAM capacity 16 KB ( bit-cells) Bit-cell size um 2 CTRL CLK freq. Supply voltage (V) Energy per (4 trees, 64 tress)(nj) Decision throughput (s/s) (4 trees, 64 trees) V. CONCLUSION 1 GHz CORE 1. CTRL.75 CORE (.9, 14.4) CTRL (.3, 5.) (5.6M, 364k) This paper has presented an IC realization of random forest (RF) algorithm to achieve energy-efficient and high throughput by co-optimizing algorithm, architecture, and circuit design. As a result, the prototype IC achieves a 3.1 energy savings and 2.2 speed-up at the same time providing a 6.8 lower energy-delay product (EDP) at the same accuracy of > 93% compared to conventional digital architecture. As a result, the proposed IC achieves a throughput of 364K s/s and energy efficiency of 19.4 nj/. To the best of our knowledge, this is the first IC realization of the RF algorithm. The benefits of the proposed architecture are expected to increase with image resolution and data size. This is because the subsampling ratio can be increased without losing classification accuracy and the random noise components in the low-swing analog in-memory comparison get averaged out better with data size. ACKNOWLEDGMENT This work was supported by Systems on Nanoscale Information fabrics (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA. The authors would 1.2 mm 64-b bus Bitcell Array Fig. 5. Chip micrograph. Input buffer & Pixel index register & Cross bar Analog comparators Replica bitcell array R/W Bitcell Array 1.2 mm Digital CTRL Test block Decision like to acknowledge constructive discussions with S. Eilert, K. Curewitz, N. Verma, B. Murmann, and P. Hanumolu. REFERENCES [1] L. Breiman, Random forests, Machine Learning, vol. 45, 1. [2] B. Van Essen, C. Macaraeg, M. Gokhale, and R. Prenger, Accelerating a random forest classifier: Multi-core, GP-GPU, or FPGA?, IEEE FCCM, 12. [3] M. Kang, M.S. Keel, N.R. Shanbhag, S. Eilert, & K. Curewitz, An Energy-efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM, IEEE ICASSP, 14. [4] M. Kang, S. Gonugondla, A. Patil, and N. Shanbhag, A 481pJ/ 3.4M /s multifunctional deep in-memory inference processor using standard 6T SRAM array, arxiv preprint arxiv: , 16. [5] J. Zhang, Z. Wang, and N. Verma, In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array, IEEE JSSC, 17. [6] J. Park, et al., A 92-mW Real-Time Traffic Sign Recognition System with Robust Illumination Adaptation and Support Vector Machine, IEEE JSSC, 12. [7] H. Kaul, et al., A 21.5M-Query-Vectors/s 3.37nJ/Vector Reconfigurable k-nearest-neighbor Accelerator with Adaptive Precision in 14nm Tri-Gate CMOS, ISSCC Dig. Tech. Papers, 16. Prior art [6] [7] Ours (M=64) Table II: Comparison with prior arts. Input Throughput Energy Process Algorithm Dataset Size (/s) (nj/) (8b) 13nm CMOS 14nm tri-gate 65nm CMOS Support Vector Machine K-nearest Neighbor Random Forest Traffic sign video Not reported KUL traffic signs [K]* 21.5M [498.8K]* 364.4K 1.5M [125]* 3.4 [145.3]* 19.4 (w/ CTRL) EDP (fjs/) 45G [3125]*.2 [292.3]* Accuracy 9% Not reported % *throughput & energy scaled to a 65nm process w/ pixels; SRAM memory access cost not included

A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array. Mingu Kang, Sujan Gonugondla, Naresh Shanbhag

A 19.4 nj/decision 364K Decisions/s In-Memory Random Forest Classifier in 6T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource