Random Kernel Perceptron on ATTiny2313 Microcontroller

Similar documents
Support Vector Machines

Tighter Perceptron with Improved Dual Use of Cached Data for Model Representation and Validation

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Mathematics 256 a course in differential equations for engineering students

Lecture 5: Multilayer Perceptrons

An Optimal Algorithm for Prufer Codes *

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Support Vector Machines

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Array transposition in CUDA shared memory

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Classification / Regression Support Vector Machines

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

A Binarization Algorithm specialized on Document Images and Photos

Conditional Speculative Decimal Addition*

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

S1 Note. Basis functions.

Machine Learning 9. week

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Classifier Selection Based on Data Complexity Measures *

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Parallel matrix-vector multiplication

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Machine Learning: Algorithms and Applications

Hermite Splines in Lie Groups as Products of Geodesics

The Codesign Challenge

Edge Detection in Noisy Images Using the Support Vector Machines

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Meta-heuristics for Multidimensional Knapsack Problems

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

CMPS 10 Introduction to Computer Science Lecture Notes

Module Management Tool in Software Development Organizations

Smoothing Spline ANOVA for variable screening

Efficient Text Classification by Weighted Proximal SVM *

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Wishing you all a Total Quality New Year!

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Support Vector Machines. CS534 - Machine Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Optimizing Document Scoring for Query Retrieval

Feature Reduction and Selection

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Solving two-person zero-sum game by Matlab

Efficient Distributed File System (EDFS)

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

K-means and Hierarchical Clustering

Cluster Analysis of Electrical Behavior

GSLM Operations Research II Fall 13/14

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Lecture 4: Principal components

5 The Primal-Dual Method

Related-Mode Attacks on CTR Encryption Mode

Unsupervised Learning

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Wavefront Reconstructor

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

User Authentication Based On Behavioral Mouse Dynamics Biometrics

THE PULL-PUSH ALGORITHM REVISITED

CHAPTER 2 DECOMPOSITION OF GRAPHS

Newton-Raphson division module via truncated multipliers

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning


TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

The Research of Support Vector Machine in Agricultural Data Classification

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Announcements. Supervised Learning

Complex System Reliability Evaluation using Support Vector Machine for Incomplete Data-set

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Learning Non-Linearly Separable Boolean Functions With Linear Threshold Unit Trees and Madaline-Style Networks

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Fast Computation of Shortest Path for Visiting Segments in the Plane

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

General Vector Machine. Hong Zhao Department of Physics, Xiamen University

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Programming in Fortran 90 : 2017/2018

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Reducing Frame Rate for Object Tracking

Correlative features for the classification of textural images

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Analysis of Collaborative Distributed Admission Control in x Networks

Data Mining: Model Evaluation

Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Transcription:

Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA vucetc@st.temple.edu ABSTRACT Kernel Perceptron s very smple and effcent onlne classfcaton algorthm. However, t requres ncreasngly large computatonal resources wth data stream sze and s not applcable on large-scale problems or on resource-lmted computatonal devces. In ths paper we descrbe mplementaton of Kernel Perceptron on ATTny233 mcrocontroller, one of the most prmtve computatonal devces wth only 28B of data memory and 2kB of program memory. ATTyny233 s a representatve of devces that are popular n embedded systems and sensor networks due to ther low cost and low power consumpton. Implementaton on mcrocontrollers s possble thanks to two propertes of Kernel Perceptrons: () avalablty of budgeted Kernel Perceptron algorthms that bound the model sze, and (2) relatvely smple calculatons requred to perform onlne learnng and provde predctons. Snce ATTny233 s the fxedpont controller that supports floatng-pont operatons through software whch ntroduces sgnfcant computatonal overhead, we consdered mplementaton of basc Kernel Perceptron operatons through fxed-pont arthmetc. In ths paper, we present a new approach to approxmate one of the most used kernel functons, the RBF kernel, on fxed-pont mcrocontrollers. We conducted smulatons of the resultng budgeted Kernel Perceptron on several datasets and the results show that accurate Kernel Perceptrons can be traned usng ATTny233. The success of our mplementaton opens the doors for mplementng powerful onlne learnng algorthms on the most resource-constraned computatonal devces. Categores and Subject Descrptors C.3 [Computer Systems Organzaton]: Specal purpose and applcaton-based systems Real-tme and embedded systems General Terms Algorthms, Performance, Expermentaton. Keywords Kernel Perceptron, Budget, RBF kernel, Mcrocontroller. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SensorKDD 0, July 25 th, 200, Washngton, DC, USA. Copyrght 200 ACM 978--4503-0224- $0.00.. I TRODUCTIO Kernel Perceptron s a powerful class of machne learnng algorthms whch can solve many real-world classfcaton problems. Intally proposed n [6], t was proven to be both accurate and very easy to mplement. Kernel Perceptron learns a mappng f : X R from a stream of tranng examples S = {(x, y ), = }, where x X s an M-dmensonal nput vector, called the data pont, and y {, +} s a bnary varable, called the label. The resultng Kernel Perceptron can be represented as f ( x ) = sgn ( α K ( x, x )), () = where α are weghts assocated wth tranng examples, and K s the kernel functon. The RBF kernel s probably the most popular choce for Kernel Perceptron because t s ntutve and often results n very accurate classfers. It s defned as 2 x x K( x, x ) = exp( ), (2) A where s the Eucldean dstance, and A s the postve number defnng the wdth of the kernel. Tranng of the Kernel Perceptron s smple: startng from the zero functon f(x) = 0 at tme t = 0, data ponts are observed sequentally, and f(x) s updated as f(x) f(x) + α K(x, x), where α = y, f pont x s msclassfed (y f(x ) 0) and α = 0 otherwse. Examples for whch α = y have to be stored and they are named the support vectors. Despte the smplcty of ths tranng algorthm, Kernel Perceptrons often acheve mpressve classfcaton accuraces on hghly non-lnear problems. On the other hand, number of the support vectors grows lnearly wth the number of tranng examples on nosy classfcaton problems. Therefore, these algorthms requre O( ) memory and O( 2 ) tme to learn n a sngle pass from examples. Because number of examples can be extremely hgh, Kernel Perceptrons can be nfeasble on nosy real-world classfcaton problems. Ths s the reason why budgeted versons of Kernel Perceptron have been proposed wth the objectve of retanng hgh accuracy whle removng the unlmted memory requrement. Budget Kernel Perceptron [4] has been proposed to address the problem of the unbounded growth n resource consumpton wth tranng data sze. The dea s to mantan a constant number of support vectors durng the tranng by removng a sngle support vector every tme the budget s exceeded upon addton of a new

support vector. Gven a budget of T support vectors, Budget Kernel Perceptrons acheve constant space scalng O(T), lnear tme to tran O(T ) and constant tme to predct O(T). There are several ways one may choose to mantan the budget. The most popular approach n lterature s removal of an exstng support vector to accommodate the new one. For example Forgetron [5] removes the oldest support vector, Random [3] the randomly selected one, Budget [4] the one farthest from the margn, and Tghtest [3] the one that mpacts the accuracy the least. There are more advanced and computatonally expensve approaches such as mergng of two support vectors [2] and projectng a support vector to the remanng ones [9]. Although Budget Kernel Perceptrons are very frugal, requrng constant memory and lnear space to tran, they are stll not applcable to one of the smplest computatonal devces, mcrocontrollers. These devces have extremely small memory and computatonal power. In ths paper we consder mcrocontroller ATTny233, whch has only 28B of memory avalable to store the model, maxmum processor speed of 8MHz and whose hardware does not support floatng-pont operatons. We propose a modfcaton of Budget Kernel Perceptron that uses only ntegers and makes t very sutable for use on resourcelmted mcrocontrollers. Our method approxmates floatng-pont Budget Kernel Perceptron operatons usng fxed-pont arthmetc that provdes nearly the same accuracy, whle allowng much faster executon tme and requrng less memory. We use Random Removal as budget mantenance method [3, ]. When the budget s full and new support vector s to be ncluded to the support vector set, one exstng support vector s randomly chosen for deleton. Although ths update rule s extremely smple, the resultng Perceptron can acheve hgh accuracy when the allowed budget s suffcently large. The pseudocode of Budget Kernel Perceptron s gven as Algorthm. It should be noted that a sgnfcant body of research exsts on the topc of effcent hardware mplementaton of varous machne learnng and sgnal processng algorthms. In Compressed Kernel Perceptron [] the authors consder effcent memory utlzaton through data quantzaton, but assume floatng-pont processor. Several solutons have been proposed for mplementaton of kernel machnes on fxed-pont processors as well. In [], authors propose specal hardware sutable for budgeted Kernel Perceptron. Smlarly, n [2] authors consder mplementaton of Support Vector Machnes [0] on fxed-pont hardware. The two proposed algorthms, however, assume that the classfers are already traned. Addtonally, use of the problem-specfc hardware s lmted to a sngle problem. We, on the other hand, mplement our method on general-purpose hardware, whch makes the mplementaton much easer and more cost-effcent. Moreover, we combne the two proposed approaches, quantzaton of data and the use of fxed-pont hardware, to obtan an accurate classfer that can be used on the smplest exstng computatonal devces. The paper s organzed as follows. In Secton 2 the proposed method s explaned n detal. In Secton 3 the results are presented and dscussed. Fnally, Secton 4 concludes the paper. Algorthm - Budget Kernel Perceptron Inputs : data sequence ((x, y ),..., (x, y )), budget T Output : support vector set SV = {SV, =... I} I 0; SV = Ø for = : { I f ( (y y K ( x, x ) 0) } { } j= j f (I == T) new = random(i) else { I I + new I } SV new = (x, y ) j 2. METHODOLOGY In ths paper we focus on mplementaton of Kernel Perceptron on the specfc mcrocontroller. We frst descrbe ts propertes and then explan the proposed algorthm. 2. Mcrocontroller The mcrocontroller ATTny233 [4] has been chosen as the target platform because ts specfcatons make t extremely challengng to mplement a data mnng algorthm. It has very lmted speed (0-8 MHz, dependng on voltage that ranges from.8 to 5.5V), low power requrements (when n.8v and MHz mode, t requres only 300µA and 540µW power), and very lmted memory (2kB of program memory that s used to store executable code and constants; 28B of data memory that s used to store model and program varables). Because of ts lmtatons t s also very cheap, wth the prce of around $ [5]. It supports 6-bt ntegers and supports fxed-pont operatons. Multplcaton and dvson of ntegers s very fast, but at the expense of potental overflow problems wth multplcaton and round-off error wth dvson. ATTny233 does not drectly support floatng-pont numbers, but can cope wth them usng certan software solutons. Floatng-pont calculatons can also be supported through C lbrary, but only after ncurrng sgnfcant overhead n both memory and executon tme. 2.2 Attrbute quantzaton In order to use the lmted memory effcently the data are frst normalzed and then quantzed usng B bts, as proposed n []. Usng quantzed data, nstead of (), the predctor s f ( x) = sgn( α K( q( x), q( x ))), (3) where q(x) s the quantzed data pont x. Quantzaton of data ntroduces quantzaton error, whch can have sgnfcant mpact on the classfcaton, especally for the data ponts that are located near the separatng boundary. The quantzaton loss suffered by the predcton model due to quantzaton s dscussed n much more detal n []. Objectve of ths paper s to approxmate (3) usng fxed-pont arthmetc.

2.3 Fxed-pont method for predcton Our algorthm should be mplemented on mcrocontroller wth extremely low memory. Although quantzaton can save valuable memory space, there are addtonal memory-related ssues that must be consdered. Frst, number of program varables must be mnmal. Prudent use of varables allows more space to store the model and leads to ncreased accuracy. On the computatonal sde, all unnecessary operatons, such as excess multplcatons and exponentals should be avoded. If ths s not taken nto consderaton, the program can take a lot of memory and a lot of tme to execute. Furthermore, due to nablty of ATTny233 to deal wth floatng-pont numbers n hardware, use of floatngpont lbrary results n even larger program sze. Instead of RBF kernel (2) our algorthm uses ts modfed verson, as proposed n [2], x x A K( x, x ) = e 2, (4) where several dfferences from RBF kernel n (2) can be notced. Frst, nstead of Eucldean dstance, Manhattan dstance s used. In [2], kernel functon wth Manhattan dstance s used because multplcatons can be completely omtted from ther algorthm. For our applcaton, we use the Manhattan dstance to lmt the range of dstances between the quantzed data, whch wll lead to mproved performance of our algorthm. Furthermore, wthout the loss of generalty, we represent the kernel wdth as 2 A, where A s any number. Although the smplfcatons n equaton (4) have been ntroduced, the algorthm cannot be mplemented wthout the penalty n the speed of computaton. Ths s because very smple devces are not desgned to compute exponents or other operatons nvolvng floatng-pont arthmetc effcently. Ths results n slow run-tme and excessve memory usage. Therefore, an alternatve way needs to be devsed to perform kernel calculaton. The dea of our method s to use only ntegers, whch are handled easly by even the smplest devces, and stll manage to use the RBF kernel for predctng the label of new pont. The normalzed data pont x s quantzed usng B bts and ts nteger representaton I(x) s B I( x ) = round( x 2 ). (5) The value of I(x) s n the range [0, 2 B ]. Snce x s now approxmated by q(x) = I(x) 2 -B, we can defne kernel functon whch takes quantzed data Kq ( I ( x) I ( x ) A+ B, x ) = K ( q( x), q( x )) = e 2 x. (6) If we represent I(x) I(x ) as d, and ntroduce for the smplcty of notaton we get A+ B g = e 2, (7) d K q (, x ) = g x. (8) T We need to calculate sgn( y K ( x, x )), where T s the budget. T = Snce sgn( y K ( x, x )) = sgn( y ck ( x, x )) for any c > 0, = we can replace q d q T q = d d nn g from (8) wth Cg, where d nn s the dstance between newly observed data pont x and the nearest support vector and C s a postve nteger. Then, to allow nteger representaton, we agan replace t wth ntegers d ( ) ( d w d d = round C g nn nn ), (9) and approxmate K x, x ) w( d d ). It s clear that f q ( nn d = d nn then w = C, f d d nn = then w = round(c g), and f d d nn s suffcently large then w = 0. We can notce that f the -th support vector s close to data pont x ts weght w and ts nfluence on classfcaton wll be large. Our fnal classfer s approxmated as T ( nn ) = f x ) = sgn( y w( d d ), (0) where y s the label of -th support vector, and w s ts weght whch depends on the dstance d from data pont x. The drawback of the proposed method s that the support vector set needs to be scanned every tme a new data pont x arrves n order to fnd the nearest support vector dstance. In addton, there s a computatonal problem related to calculaton of weghts. These ssues wll be addressed n Secton 2.4. 2.3. Error of weght approxmaton By usng roundng operaton (9) we are nevtably ntroducng addtonal error, on top of the error made due to quantzaton of data pont usng B bts []. In order to understand the effect of the approxmaton error, let us defne the relatve error R as true _ value approxmaton R=, () true _ value where true_value s the actual value of the number, and approxmaton s value of the number after roundng. It can be shown [7] that the relatve error R we are makng by the roundng (9) depends on the parameter C as R~/C. Ths shows that large C leads to small relatve error. However, because the mcrocontroller supports only 6-bt ntegers, too large C wll render our method useless on the chosen platform. 2.4 Weght calculaton To be able to use predctor (0) there s a need to calculate w for every support vector. For gven parameters (A, B, C) we can calculate the weght for every possble dstance d = d d nn as llustrated n Table. In the naïve approach, we can precalculate all the weghts and save them as array n mcrocontroller memory. If the dataset s M-dmensonal, then d s n range [0, M (2 B )], and storng each weght could become mpossble. Table Dstances and correspondng weghts Dstance (d) 0 d Weght (w(d)) Cg 0 Cg Cg d Therefore, we must fnd a way to represent entre array of weghts wth only a subset of weghts n order to save memory. Two

approaches are dscussed next, both requrng very lmted memory. 2.4. Sequental method The most obvous way s to store weghts w(0) and w(), for dstances 0 and, respectvely. By storng these two weghts we can calculate other weghts n Table. Every other weght can be teratvely calculated, n nteger calculus, as w( d ) 2 w ( d) =. (2) w( d 2) The memory requrement of ths approach s mnmal, but there are certan problems. In order to calculate partcular weght we need to calculate all the weghts up to that one by repeatedly applyng (2), whch can take substantal tme. In addton, the dvson of two ntegers results n roundng error that would propagate and ncrease for large d. 2.4.2 Log method Dfferent dea s to save only the weght of dstance 0, whch equals C, together wth the weghts at dstances that are the power of 2, namely weghts of dstances, 2, 4, 8, 6 and so on. In ths way the number of weghts we need to save s logarthm of total number of dstances, whch s approxmately log(m 2 B ). Usng the saved weghts we calculate weght for any dstance usng Algorthm 2. Algorthm 2 - Fnd Weght Inputs : array of weghts W and correspondng dstances D dstance d for whch the weght s beng calculated Output : weght w Index of the nearest smaller dstance to d n array D w W[] d d D[] whle (d > 0) { Index of the nearest smaller dstance to d n array D w w W[] / W[0] d d D[] } As can be seen n Algorthm 2, f we are lookng for the weght of the dstance d that s the power of 2, the weght wll mmedately be found n the array W. If not, we are lookng for the dstance n array D that s the closest smaller value than d, and start the calculaton from there. We repeat the process untl convergence. In ths way, we wll always try to make the smallest error whle calculatng the weght, and the algorthm runs usng O(log(M 2 B )) space and O(log(M 2 B )) tme, whch s effcent and fast even for hghly-dmensonal data and large bt-lengths. Log method s llustrated n Fgure. Assume that we are lookng for the weght of dstance 25, and we saved weghts at dstances 0,, 2, 4, 8, 6 and 32. It s clear that 25 can be represented as 6 + 8 +. In the frst step we wll use w(6), then w(8), because 8 s the closest to 25 6, and fnally w(). The sought-after weght wll be found both very quckly and wth mnmal error. 3. EXPERIME TAL RESULTS We evaluated our algorthm on several benchmark datasets from UCI ML Repostory whose propertes are summarzed n Table 2. The dgt dataset Pendgts, whch was orgnally mult-class, was converted to bnary dataset n the followng way: classes representng dgts, 2, 4, 5, 7 (non-round dgts) were converted to negatve class, and those representng dgts 3, 6, 8, 9, 0 (round dgts) to the postve class. Banana and Checkerboard were orgnally bnary class datasets. Checkerboard dataset s shown n Fgure 8. Dataset Table 2 Dataset summares Tranng set Testng set sze sze Number of attrbutes Banana 4800 500 2 Checkerboard 3500 500 2 Pendgts 2998 500 6 In all reported results, A s the kernel wdth parameter, B s btlength, and C s postve nteger, all defned n Secton 2.3. In Fgure 2 the absolute dfference between the true weghts and the calculated ones usng sequental and log methods for weght calculaton descrbed n Sectons 2.4. and 2.4.2 s shown. It can be seen that log method makes very small error wth mnmal memory overhead over the whole range of dstances. We use the log method as our default for weght calculaton. Error of approxmaton 40 35 30 25 20 5 0 5 0 0 20 40 60 80 00 20 Dstance Sequental method Log method Banana dataset A = - B = 5 bts C = 00 Fgure 2 Error of two weght calculaton methods Fgure Illustraton of log method In the next set of experments the qualty of approxmaton was evaluated. Frst, the Random Kernel Perceptron was traned usng equaton (6) that requres floatng-pont calculatons. Then, our fxed-pont method was executed, and the predctons of the two algorthms were compared. The approxmaton accuracy s defned as

fxed floatng I ( y, y ) accuracy= =, (3) where I s the ndcator functon that s equal to when the predctons are the same and 0 otherwse, y s the classfer predcton, and s the sze of the test set. Approxmaton accuracy 0.9 0.4 0 0 20 30 40 50 60 70 80 90 00 Parameter C Banana dataset A = 0 B = 5 bts budget = 70 bytes Fgure 3 Approxmaton accuracy as the functon of C Fgure 3 shows that the approxmaton accuracy s relatvely large even for small values of C and that t quckly ncreases wth C, as expected from dscusson n Secton 2.3.. Snce ATTny233 mcrocontroller supports 6-bt ntegers, the maxmum value of a number s 65535. However, as can be seen from Algorthm 2, n order to calculate weghts multplcatons are requred, and at no pont the product should exceed ths maxmum value. Therefore, the best choce for C s the square root of 65535, whch s 255. In the followng experments C was set to 255. In Fgure 4 the approxmaton accuracy s gven for 3 datasets. Total budget was fxed to only 70 bytes. Each dataset was shuffled before each experment and the reported results are averages after 0 experments. The value of parameter A was changed n order to estmate ts mpact on the performance. Hgh negatve values of A correspond to the algorthm that acts lke nearest neghbour. Hgh postve values of parameter A correspond to the majorty vote algorthm. It can be seen that over large range of A values the approxmaton accuracy was around 99%, and that s was lower at the extreme values of A. Ths behavor s due to the numercal lmtatons of double-precson format used when calculatng equaton (6). The exponentals of large negatve numbers are too close to 0 and are rounded to 0, whch results n much smaller accuracy. On the other hand, t can be seen that for large values of A correspondng to the algorthm that acts lke majorty vote the approxmaton accuracy starts to decrease. Ths was expected, because n ths case all the weghts, as calculated by our method, are practcally the same snce the weghts decrease extremely slowly and very small dfferences between actual weghts are lost when they are calculated usng the log method. However, t should be noted that n ths problem settng kernel wdth depends on A as 2 A, and the values where predcton accuracy starts to fall are practcally never used. Experments were conducted for dfferent values of parameter B as well. However, the results for other bt-lengths were very smlar, and are thus not shown. Approxmaton accuracy 0.9 0.4 0.3 0.2 0. 0-8 -6-4 -2 0 2 4 6 8 Kernel wdth parameter A Banana Pendgts Checkerboard Fgure 4 Approxmaton accuracy (B = 5 bts) The next set of experments was conducted to evaluate predcton accuracy of the proposed Random Kernel Perceptron mplementaton. The values of parameters A and B were changed n order to estmate ther nfluence on the accuracy of methods. The results are the averages over 0 experments and are reported n Fgures 5, 6 and 7. Accuracy 5 5 5 5 Banana dataset B = 5 bts budget = 70 bytes -0-8 -6-4 -2 0 2 4 Kernel wdth parameter A Fgure 5 Accuracy of two methods Floatng-pont method Fxed-pont method It can be seen that both floatng- and fxed-pont mplementatons acheve essentally the same accuracy, as was expected snce the approxmaton has been proven to be very good. In some cases for very small values of A the accuracy of floatng-pont method drops sharply, as n Fgure 6. Ths happens because of the numercal problems, where the predctons are so close to 0 that when calculated n double-precson they are rounded to 0. Our method does not have ths problem. It s also worth notcng that for large A the accuracy drops sharply for both algorthms, whch supports the clam that large kernel wdths are usually not useful. It s exactly for these large values of A that the approxmaton accuracy drops as well. Therefore, as can be seen n Fgures 5, 6 and 7, these values are not of practcal relevance. Only results for bt-length of 5 bts were shown, snce for the other bt-lengths the results were nearly the same.

Accuracy Accuracy 0.9 0.4 0.3 0.2-0 -8-6 -4-2 0 2 4 5 5 5 5 Pendgts dataset B = 5 bts budget = 507 bytes Kernel wdth parameter A Floatng-pont method Fxed-pont method Fgure 6 Accuracy of two methods -0-8 -6-4 -2 0 2 4 3 2 0 - -2 Checkerboard dataset B = 5 bts budget = 70 bytes Kernel wdth parameter A Floatng-pont method Fxed-pont method Fgure 7 Accuracy of two methods -3 0 2 3 4 5 6 Fgure 8 Checkerboard dataset 3. Hardware mplementaton The algorthm was tested on mcrocontroller ATTny233. The program was wrtten n C programmng language usng free IDE AVR Studo 4 avalable for download from Atmel s webste. Dfferent methods were used n order to assess the complexty of mplementaton on mcrocontroller. The detals are gven n Tables 3, 4 and 5. Program and Data represent the requred program and data memory n bytes. Tme, n mllseconds, s gven after 00 observed examples, and kernel wdth parameter A s set to 6. Accuracy s calculated after all data ponts were observed. Numbers n the brackets represent the total sze of memory block of the mcrocontroller. The mcrocontroller speed s set to 4 MHz. Number of support vectors was chosen so that the entre data memory of mcrocontroller s utlzed n the case of fxed-pont method. Ths number of support vectors was then mposed on both methods. Therefore, as the bt-length s ncreased, the number of support vector decreased. For Tables 3 and 4 the target mcrocontroller was ATTny233. For Table 5 mcrocontroller wth twce bgger memory was used because of the hgh dmensonalty of Pendgts dataset. ATTny48 was chosen whch has 4kB of program memory and 256B of data memory. It s clear that proposed fxed-pont algorthm takes much less memory and s faster than the floatng-pont algorthm. Both tme and space costs of fxed-pont algorthm are nearly 3 tmes smaller. In fact, n order to run smulatons usng floatng-pont numbers, mcrocontroller ATTny84 wth four tmes larger memory was used snce the memory requrement was too bg for both ATTny233 and a more powerful ATTny48, colored red n the tables. The accuracy of these two methods s practcally the same, and the dfference n computatonal cost s therefore even more strkng. In addton, t can be concluded that accuracy depends greatly on the number of support vectors. The hgher number of support vectors usually means hgher accuracy. However, n our experments, hgher number of support vectors s acheved at the expense of smaller bt-lengths, so the results n the tables can be msleadng. In some experments larger support vector set does not necessarly lead to better performance because the quantzaton error s too bg and ths affects predctons consderably. It should be noted that the support vector set sze/bt-length trade-off s not the topc of ths work and wll be addressed n our future study. 4. CO CLUSIO In ths paper we proposed approxmaton of Kernel Perceptron that s well-sutable for mplementaton on resource-lmted devces. The new algorthm uses only ntegers, wthout any need for floatng-pont numbers, whch makes t perfect for smple devces such as mcrocontrollers. The presented results show that the approach consdered n ths paper can be successfully mplemented. The descrbed algorthm approxmates the kernel functon defned n (6) wth hgh accuracy, and yelds good results on several dfferent datasets. The results are, as expected, not perfect, whch s the consequence of the hghly lmted computaton capabltes of ATTny233 that requred several approxmatons n calculaton of the predcton. Whle the quantzaton and approxmaton error lmt the performance, the degradaton s relatvely moderate consderng very lmted computatonal resources. The smulatons on ATTny mcrocontrollers proved that the algorthm s very frugal and fast, whle beng able to match the performance of ts unconstraned counterpart. Although the kernel functon defned n (6) s used, whch s a slghtly modfed RBF kernel, the fxed-pont method can also be used to approxmate the orgnal RBF kernel (2). The proposed dea could probably be extended to some other kernel functons

Table 3 Mcrocontroller mplementaton costs Banana 2 bts 4 bts 6 bts 8 bts ATTny233 Fxed Float Fxed Float Fxed Float Fxed Float Program [B] (2048B) 744 6036 720 602 720 602 748 6040 Data [B] (28B) 28 379 28 379 28 379 28 38 Tme [ms] 92 6604 985 7505 883 760 739 7496 Accuracy [%] 67.32 65.2 8.08 8.00 79.36 79.60 78.00 77.80 # of SVs 2 62 43 32 Table 4 Mcrocontroller mplementaton costs Checkerboard 2 bts 4 bts 6 bts 8 bts ATTny233 Fxed Float Fxed Float Fxed Float Fxed Float Program [B] (2048B) 744 6036 720 602 720 602 748 6040 Data [B] (28B) 28 379 28 379 28 379 28 38 Tme [ms] 568 7626 2226 6725 240 7366 2232 7703 Accuracy [%] 52.20 5.20 72.60 72.64 78.20 78.60 74.04 73.80 # of SVs 2 62 43 32 Table 5 Mcrocontroller mplementaton costs Pendgts 2 bts 4 bts 6 bts 8 bts ATTny48 Fxed Float Fxed Float Fxed Float Fxed Float Program [B] (4096B) 836 6078 80 6058 82 6056 86 592 Data [B] (256B) 256 53 256 5 256 507 256 503 Tme [ms] 3280 486 3572 5352 488 702 5063 7373 Accuracy [%] 93.80 94.20 92.76 93.92 86.44 86.32 82 8.68 # of SVs 46 23 5 that requre operatons wth floatng-pont numbers, such as polynomal kernel. Ths would be of great practcal mportance for problems that do not work well wth the RBF kernel. Ths ssue wll be pursued n our future work. The proposed algorthm could be mproved by some more advanced budget mantenance method, such as the one descrbed n [8]. By usng support vector mergng, the performance of predctor would certanly mprove, but t s to be seen f the lmted devce can support such an approach. Also, n ths paper we assumed that parameters A and B, kernel wdth and bt-length, respectvely, are gven. Although ths s reasonable f we want to use ths method for applcatons where the classfcaton problem s well understood, t would be nterestng to mplement on a resource-lmted devce an algorthm that can automatcally fnd the optmal parameter values. 5. ACK OWLEDGME TS Ths work s funded n part by NSF grant IIS-054655. 6. REFERE CES [] Anguta, D., Bon, A., Rdella, S., Dgtal Kernel Perceptron, Electroncs letters, 38: 0, pp. 445-446, 2002. [2] Anguta, D., Pschutta, S., Rdella, S., Sterp, D., Feed Forward Support Vector Machne wthout multplers, IEEE Transactons on eural etworks, 7, pp. 328-33, 2006. [3] Cesa-Banch, N., Gentle, C., Trackng the best hyperplane wth a smple budget Perceptron, Proc. of the neteenth Annual Conference on Computatonal Learnng Theory, pp. 483-498, 2006. [4] Crammer, K., Kandola, J., Snger, Y., Onlne Classfcaton on a Budget, Advances n eural Informaton Processng Systems, 2003. [5] Dekel, O., Shalev-Shwartz, S., Snger, Y., The Forgetron: A kernel-based Perceptron on a fxed budget, Advances n eural Informaton Processng Systems, pp. 259-266, 2005. [6] Freund, Y., Schapre, D., Large Margn Classfcaton Usng the Perceptron Algorthm, Machne Learnng, pp. 277-296, 998.

[7] Hldebrand, F. B., Introducton to Numercal Analyss, 2 nd edton, Dover, 987. [8] Nguyen, D., Ho, T., An effcent method for smplfyng support vector machnes, Proceedngs of ICML, pp. 67 624, 2005. [9] Orabona, F., Keshet, J., Caputo, B., The Projectron: a bounded kernel-based Perceptron, Proceedngs of ICML, 2008. [0] Vapnk, V., The nature of statstcal learnng theory, Sprnger, 995. [] Vucetc, S., Corc, V., Wang, Z., Compressed Kernel Perceptrons, Data Compresson Conference, pp. 53-62, 2009. [2] Wang, Z., Crammer, K., Vucetc, S., Mult-Class Pegasos on a Budget, Proceedngs of ICML, 200. [3] Wang, Z., Vucetc, S., Tghter Perceptron wth Improved Dual Use of Cached Data for Model Representaton and Valdaton, Proceedngs of IJC, 2009. [4] www.atmel.com/dyn/resources/prod_documents/doc2543.pdf ATTny233 Datasheet [5] www.dgkey.com