IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 1

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 1"

Camron Cole
5 years ago
Views:

1 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 1 High Quality Dow-Samplig for Determiistic Approaches to Stochastic Computig M. Hassa Najafi, Studet Member, IEEE, David J. Lilja, Fellow, IEEE Abstract Determiistic approaches to stochastic computig (SC) have bee recetly proposed to remove the radom fluctuatio ad correlatio problems of SC ad so produce completely accurate results with stochastic logic. For may applicatios of SC, such as image processig ad eural etworks, completely accurate computatio is ot required for all iput data. Decisio-makig o some iput data ca be doe i a much shorter time usig oly a good approximatio of the iput values. While the determiistic approaches to SC are appealig by geeratig completely accurate results, the cost of precise results makes them eergy iefficiet for the cases whe slight iaccuracy is acceptable. I this work, we propose a high quality dow-samplig method for previously proposed determiistic approaches to SC by geeratig pseudo-radom but accurate stochastic bit streams. The result is a much better accuracy for a give umber of iput bits. Experimetal results show that the processig time ad the eergy cosumptio of these determiistic methods are improved up to 61% ad 41%, respectively, while allowig a mea absolute error (MAE) of 0.1%, ad up to 500X ad 334X improvemet, respectively, for a MAE of 3.0%. The accuracy ad the eergy cosumptio are also improved compared to covetioal radom stream-based stochastic implemetatios. Idex Terms Stochastic computig, high quality dow-samplig, determiistic computig, eergy-efficiet processig, uary bit-stream, pseudo-radomized bit-stream. 1 INTRODUCTION STOCHASTIC Computig (SC) [1], [6], [21] has bee aroud for may years as a oise-tolerat approximate computig approach. Logical computatio is performed o probability data represeted by uiformly distributed radom bit-streams. Image ad video processig [2], [5], [14], [20], digital filters [15], low-desity parity check decodig [16], [22] ad eural etworks [3], [4], [9], [11], [12], [13] have bee the mai target applicatios for SC. Low hardware cost ad low power cosumptio advatages of this computig paradigm have ecouraged desigers to implemet complex calculatios i the stochastic domai. Multiplicatio, for istace, ca be implemeted usig a simple stadard AND gate. Iheret skew tolerace [18] ad progressive precisio [2] are other iterestig advatages of computatio o stochastic bit-streams. Specifically, the computatio results are correct eve whe the iputs are misaliged temporally, ad the quality of the results improves as the computatio proceeds. These two properties ca be exploited i improvig the processig speed of stochastic systems by optimizig clock distributio etworks ad makig quick decisios o the iput data. Radom Fluctuatio, however, has always made stochastic computatio somewhat iaccurate. Due to radom fluctuatio, stochastic operatios ofte eed to ru for a very log time to produce highly accurate results. Recet progress i the idea of SC [17] [8], however, has revolutioized the paradigm ad has chaged the commo belief o stochastic processig, that is, SC does ot ecessarily have to The authors are with the Departmet of Electrical ad Computer Egieerig, Uiversity of Miesota, Twi Cities, MN USA ( ajaf011@um.edu; lilja@um.edu). Mauscript received Jue 28, 2017; revised Dec 7, Mea Absolute Error (%) ^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 Operatio Cycles Deter-PrimeLegth Deter-Rotatio Deter-ClkDiv Cov-Radom Fig. 1. Progressive Precisio compariso of the covetioal radom stream-based SC with the uary stream-based determiistic approaches of SC whe multiplyig two 8-bit precisio iput values. be a approximate computig approach. If properly structured, radom fluctuatio ca be removed ad SC circuits ca produce determiistic ad completely accurate results. Najafi et al [17] have show that by choosig relatively prime legths for a specific class of stochastic streams called uary streams, ad repeatig the streams up to the least commo multiple of the stream legths, a determiistic ad completely accurate output ca be produced by stochastic logic. Jeso ad Riedel [8] further proposed two determiistic approaches of processig uary streams, rotatio of streams ad clock divisio. The proposed approaches ot oly are able to produce completely accurate results (i.e., zero percet error rate), but they also improve the hardware cost ad the processig time of stochastic operatios sigificatly whe compared to those of the computatios performed o the covetioal radomized stochastic bitstreams.

2 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 2 While the uary stream-based determiistic approaches proposed i [17] ad [8] are able to produce completely accurate results (i.e., results that are the same as the results of biary-radix computatio), they suffer from a poor progressive precisio property; The output coverges to the expected correct value very slowly. This drawback ca be a major limitatio to wide use of these approaches i differet applicatios. While ideally we are iterested i producig completely accurate results, decisio makig o some iputs, particularly i image processig ad eural etwork applicatios, do ot require high precisio operatio ad a low-precisio estimate of the output value is sufficiet. I such cases, due to the poor progressive precisio property of uary streams, stochastic operatios must ru for a much loger time tha the cases with covetioal radom bitstreams to produce acceptable results with small levels of iaccuracy. Whe small rates of iaccuracy are acceptable, usig the uary stream-based determiistic approaches will lead to a very log operatio time ad cosequetly a very high eergy cosumptio. Fig. 1 compares the progressive precisio of differet stochastic approaches whe multiplyig 8-bit precisio iput values. As ca be see i the figure, the covetioal radom stream-based stochastic approach shows a much better progressive precisio tha the uary stream-based determiistic approaches ad so is the preferable choice for ay applicatio that ca tolerate some errors, such as image processig ad eural etwork applicatios. I this paper, we propose a dow-samplig method for the determiistic approaches itroduced i [17] ad [8] that improves their progressive precisio property. By modifyig the structure of the stream geerators, the determiistic methods ot oly are able to produce completely accurate results, they are also able to produce acceptable results i a much shorter time ad so with a much lower eergy cosumptio compared to the curret architectures that geerate ad process uary streams. Our experimetal results further show that, for the same operatio time, determiistic dow-samplig of the rotatio ad the relatively prime legth approaches produces results with a lower average error rate tha the error rate of processig covetioal radom stochastic streams. This paper is structured as follows: Sectio 2 presets backgroud iformatio o SC icludig the covetioal radom stream-based approach ad the determiistic approaches to SC. I Sectio 3, we describe our proposed dow-samplig method. I Sectio 4, we validate ad compare the proposed idea to the prior methods with experimetal results. Fially, i Sectio 5, we preset coclusios. 2 BACKGROUND 2.1 Stochastic Computig I SC, computatio is performed o radom [21] or uary bit-streams [19] [17] [8] where the iput value is ecoded by the probability of obtaiig a oe versus a zero. Uipolar ad bipolar formats are the two geeral represetatios for umbers i the stochastic domai. While the uipolar format ca oly be used for represetig positive data i iterval [0,1], the bipolar format ca deal with both positive ad egative values i [-1, 1]. I the uipolar represetatio, (-bit) Number Source (LFSR, Couter ) (-bit) Costat Number (register) (-bit) Comparator Fig. 2. Structure of a stochastic stream geerator the ratio of the umber of oes to the legth of bit stream determies the value, while i the bipolar format, the value is determied by the differece betwee the umber of oes ad zeros compared to the stream legth. For example, is a represetatio of 0.4 i the uipolar format ad -0.2 i the bipolar format. For the rest of the paper we use the uipolar format. However, the proposed idea is idepedet of the format of bit-streams ad ca also be applied to the bipolar represetatio. The iputs to stochastic systems must first be coverted to stochastic bit-streams to be processed by stochastic logic. The commo approach for covertig digital data i biary radix format ito radom stochastic bit-streams is by comparig a radom value geerated by a radom or pseduo-radom source to the target value. Liear feedback shift registers (LFSRs) are ofte used as the pseudo-radom source i these stream geerators. To covert biary iput data to uary streams, a icreasig/decreasig value from a up/dow couter is compared to the target value. Fig. 2 shows the structure of these basic stochastic stream geerators. Whe performig computatio o radom stochastic bit-streams, due to the iheret radom fluctuatios, the legth of bit-streams have to be much loger tha the precisio expected for the computatio result. Some operatios, such as multiplicatio, also suffer from correlatio betwee bit-streams. For these operatios, the iput bitstreams must be idepedet to produce accurate results. To produce a output with -bit precisio, the iput bitstreams legth, ad so the umber of cycles performig the operatio, must be greater tha 2 2i 2, where i is the umber of idepedet iputs i the circuit [8]. Due to these properties, stochastic processig of radom bit-streams is a approximate computatio, as illustrated i Fig. 3.a. 2.2 Determiistic Approaches to Stochastic Computig Recet work o SC [17] [8] [19] has show that SC does ot ecessarily have to be a approximate approach ad the result of computatio ca actually be completely accurate ad determiistic. Istead of radom stochastic bit-streams, logical computatio is performed o a specific class of bitstreams, called uary streams. A uary stream cosists of a sequece of 1s followed by a sequece 0s. For example, is a uary stream represetig 0.4 i the uipolar format. To represet a value with resolutio of 1/2 (-bit precisio), the uary stream must be 2 bits log. For operatios that require idepedet iputs, the idepedece betwee the iput uary streams is provided by usig relatively prime stream legths [17], rotatio, or clock divisio [8]. Fig. 3(b)-(d) exemplifies these three determiistic approaches to SC.

3 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 3 1/3 1/3 1/4 1/ /12 (b) Determiistic: Relatively Prime Stream Legths (a) Approximate: Radom Streams (c) Determiistic: Clock Divisio (d) Determiistic: Rotatio / / /16 Fig. 3. Examples of performig stochastic multiplicatio: a) covetioal approximate SC with radom bit-streams (b)-(d) recetly proposed determiistic approaches to SC with uary bit streams. To produce accurate results with these determiistic approaches, the operatio must ru for a exact umber of clock cycles which is equal to the product of the legth of the iput bit streams. For example, whe multiplyig two -bit precisio iput values represeted usig two 2 -bit streams, the operatio must ru for exactly 2 2 cycles [8]. Ruig the operatio for fewer cycles (e.g., cycles) will lead to a poor result with a error out of the acceptable error boud. This importat source of iaccuracy i performig computatios o uary streams is called trucatio error [17]. As a example, assume we wat to multiply two 8- bit precisio umbers, represeted usig uary streams, with the rotatio or clock divisio determiistic approaches. The operatio must ru for exactly 2 16 = cycles to produce a completely accurate result. Exhaustively testig the multiplicatio operatio for every possible pair of iput values whe ruig the operatio for 2 15 ad 2 10 cycles shows a mea absolute error (MAE) of 3.10% ad 7.99%, respectively, for the rotatio approach, ad 12.3% ad 24.4% for the clock divisio approach. With the covetioal approach of processig radom bit-streams whe exhaustively testig the operatio o a large set of radom pairs of iput values, although we could ot produce completely accurate multiplicatio results i 2 16 cycles, a good progressive precisio property could lead to acceptable results whe ruig the operatio for the same umber of operatio cycles (MAE of 0.15% after 2 15 ad 1.20% after 2 10 cycles). 3 DOWN-SAMPLING FOR DETERMINISTIC SC While the radomess iheret i stochastic bit-streams was oe of the mai sources of iaccuracy i SC, distributig the oes across the stream istead of groupig them (i.e., first all oes ad the all zeros) may be able to provide a good progressive precisio property for represetig stochastic umbers ad, therefore, for the computatio. With radomized bit-streams the result quality improves as the computatio : 7/16 (a) Stochastic Radomized Bit-Stream : 8/16 (b) Determiistic Uary Bit-Stream : 8/16 (c) Determiistic Pseudo-Radomized Bit-Stream Fig. 4. Differet types of stochastic bit-streams. proceeds. This is because short sub-sequeces of log radom stochastic bit-streams provide low-precisio estimates of the streams values. This property ca be exploited i may applicatios of SC for makig quick decisios o the iput data ad so icreasig the processig speed. Determiistic approaches proposed i [17] ad [8] perform computatio o uary streams. Due to the ature of uary represetatio, trucatig the bit-stream leads to a high trucatio error ad so a sigificat chage i the represeted value. I this work, we propose a high quality dow-samplig approach for the determiistic approaches to SC by brigig radomizatio back ito the represetatio of bit-streams. Similar to processig uary streams, the computatios are completely accurate whe the operatios are executed for the required umber of cycles. However, by pseudo-radomizig the streams, the computatio will have a good progressive precisio property ad trucatig the output streams by ruig for fewer clock cycles still produces high quality outputs. For a determiistic ad predictable radomizatio of the bit-streams, we propose to use maximal period pseudoradom sources (i.e., a maximal period LFSR) to geerate the bit-streams. The importat poit is that the period of the pseudo-radom source should be equal to the legth of the bit-stream. By usig such a source to geerate radom umbers, we are able to covert a iput value ito a pseudoradom but completely accurate stochastic represetatio. Fig.4 illustrates a example of represetig 0.5 value with a radom, a uary, ad our proposed pseudo-radomized bit-stream. Table 1 compares the MAEs of the covetioal radom stream-based SC ad the uary stream-based determiistic approaches of [17] ad [8] with the approach proposed i this work by exhaustively testig multiplicatio of two 8- bit precisio stochastic streams o a large set of radom iput values for the covetioal radom SC ad for the proposed approach, ad o every possible iput value for the uary determiistic approaches. For the covetioal radom stochastic approach, we evaluate the accuracy with two differet structures for covertig the iput values to radomized stochastic bit-streams: 1) usig maximal period 8-bit LFSRs, ad 2) usig maximal period 16-bit LFSRs to emulate a true-radom umber geerator. Two differet LFSRs (i.e., differet desigs 1 with differet seeds) are used i each case to geerate idepedet bit-streams. While the first structure ca accurately covert the iput values to 1. Two out of 16 differet desigs of maximal period 8-bit LFSRs ad two out of 2,048 differet desigs of maximal period 16-bit LFSRs described i [10] are radomly selected for each ru.

4 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 4 TABLE 1 Mea Absolute Error (%) compariso of the prior radom ad determiistic approaches to stochastic computig ad the proposed determiistic approaches based o pseudo-radomized streams whe multiplyig two 8-bit precisio stochastic streams with differet umbers of operatio cycles. Desig Approach SNG Covetioal Prior work [1], [21]: two LFSR Radom Stochastic Prior work [1], [21]: two LFSR Determiistic Prior work [8], [17]: two couter Prime Legth This work: two LFSR Determiistic Prior work [8]: two couter Clock Divisio This work: two LFSR Determiistic Prior work [8]: two couter Rotatio This work: two LFSR bit 2 pseudo-radom bit-streams, the secod structure coverts the iputs to ay stream with a legth less tha 2 16 to give a approximate represetatio of the value. With the first structure, after 256 cycles, the geerated bit-streams repeat ad so the accuracy of the operatio ever improves after this time. Due to a more precise represetatio, the first structure shows a better MAE for low stream legths. However, for very log bit-stream legths, the secod structure ca produce a better MAE. The hardware cost of the secod structure is twice that of the first oe because of usig larger LFSRs. Note that due to radom fluctuatio ad correlatio, either of these two structures ca produce completely accurate results i 2 16 cycles. As show i Table 1, the determiistic approaches proposed i [17] ad [8] are able to produce completely accurate results whe ruig the operatio for 2 16 cycles. Due to usig uary bit-streams, however, the MAE of the computatio icreases sigificatly whe ruig the operatio for fewer cycles. This chage clearly shows the poor progressive precisio property ad the high trucatio error of these methods. Istead of uary streams, i this work we use pseudo-radomized but accurate bit-streams. Itegratig these bit-streams with the determiistic approaches results i completely accurate computatio whe it is ru for the required umber of cycles while still producig high quality results if the output stream is trucated. As discussed i Sectio 2, i the determiistic approaches to SC, the required idepedece betwee iput streams is provided by usig relatively prime legths, rotatio, or clock divisio. Whe ruig the operatios for the product of the legth of the streams, these three methods cause every bit of the first stream to iteract with every bit of the secod stream [8], [17]. The computatio is therefore performed determiisticly ad accurately irrespective of the locatio of the oes i each stream. Thus, as demostrated i Fig. 5 with the iteractio of two pseudo-radomized bit-streams, there is actually o requiremet to use uarystyle streams ad, istead, we use pseudo-radomized bitstreams for the determiistic approaches. We use differet LFSRs (differet LFSR desigs ad differet seeds) for geeratig pseudo-radomized bitstreams. The period of the LFSR should be maximal ad 2. A -bit maximal period LFSR has a period of 2 1, as the 0-state i the LFSR is ormally ot used. Here, for a fair compariso with the uary stream-based determiistic approaches, we add a 0-state to the set of the states of each LFSR to geerate 2 uique umbers. a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 b 1 b 0 b 3 b 1 b 0 b 3 b 1 b 0 b 3 b 1 b 0 b 3 a) Relatively Prime Legths a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 b 1 b 1 b 1 b 1 b 0 b 0 b 0 b 0 b 3 b 3 b 3 b 3 b 2 b 2 b 2 b 2 b) Clock Divisio a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 a 0 a 3 a 1 a 2 b 1 b 0 b 3 b 2 b 2 b 1 b 0 b 3 b 3 b 2 b 1 b 0 b 0 b 3 b 2 b 1 c) Rotatio Fig. 5. Determiistic approaches to SC by two pseudo-radomized bitstreams. equal to the legth of the bit-stream to accurately represet each value. Thus, for 8-bit precisio iputs, a 8-bit size maximal period LFSR is required. Table 1 compares the MAE of the determiistic approaches whe multiplyig the iputs streams geerated usig the proposed approach. Similar to the uary stream-based determiistic approaches of [8], the proposed method results i completely accurate results whe ruig the operatio for 2 16 cycles, but produce a much lower MAE whe ruig for fewer cycles. Compared to the covetioal radom SC, the relatively prime legth ad the rotatio approaches produce results with a lower MAE. Note that, similar to the uary-stream based determiistic approaches that require separate couters for geeratig idepedet iput bit-streams [8], sharig LFSRs i the proposed method is ot possible. I the clock divisio determiistic approach, each LFSR must be drive with a differet clock source which as a result prevets usig optimizatio techiques such as sharig LFSRs+shiftig [7] to save hardware cost. Similarly, the limitatio of usig umber sources with differet periods i the relatively prime approach ad stallig umber geerators i the rotatio approach prevet us from sharig pseudo-radom umber geerators i the proposed method.

5 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 5 4 EXPERIMENTAL RESULTS To evaluate the proposed idea, we used the stochastic implemetatio of a well-kow digital image processig algorithm, Robert s cross edge detectio. I this edge detector, each operator cosists of a pair of 2 2 covolutio kerels that process the pixels of the iput images based o their three immediate eighbors: Number Source 1 CLK -bit LFSR 1 Stop State COMP a) Relatively Prime Legths Number Source 2 Reset -bit LFSR 2 Y i,j = 0.5 ( X i,j X i+1,j+1 + X i,j+1 X i+1,j ) CLK -bit LFSR 1 -iput AND -bit LFSR 2 where X i,j is the value of the pixel at locatio (i, j) of the iput image ad Y i,j is the correspodig output value. Fig. 6 shows the stochastic implemetatio of this algorithm proposed i [2]. CLK -bit LFSR 1 b) Clock Divisio -bit LFSR 2 X i,j X i+1,j+1 X i+1,j X i,j+1 -iput AND Ihibit XOR XOR c) Rotatio Fig. 7. Proposed sources of geeratig pseudo-radom umbers for the three determiistic approaches to SC MUX Fig. 6. Stochastic circuit for Robert s cross edge detectio algorithm [2]. The two XOR gates compute absolute value subtractio whe they are fed with correlated iput streams (streams with maximum overlap betwee 1s). Sharig the same source of umbers (i.e., same LFSR) for geeratig the iput streams ca provide correlated streams. The Multiplexer (MUX) uit, o the other had, performs scaled additio irrespective of correlatio betwee its mai iput streams. The importat poit, however, is that the select iput stream (here, a stream with the value 0.5) should be idepedet to the mai iput streams to the MUX. Thus, for the Robert s cross stochastic circuit, the four mai iput streams (the iputs to the XOR gates) should be correlated to each other, but should be idepedet of the select iput of the MUX. Two umber geerators are, therefore, required for this circuit oe for covertig the mai iputs ad oe for geeratig the select iput stream. We evaluate the performace, the hardware area, the power, ad the eergy cosumptio of the Robert s cross stochastic circuit i three differet cases: 1) the covetioal approach of processig radom streams, 2) the prior determiistic approaches of processig uary streams, ad 3) the proposed determiistic approaches of processig pseduoradomized streams. The circuit show i Fig. 6 is the core stochastic logic ad will be shared betwee all cases. Fig. 7 shows our proposed structures of the sources for geeratig pseudo-radom umbers for the three determiistic approaches. For the relatively prime legth approach, we assume the first umber source has a period of 2 1ad we cotrol the period of the secod source by settig a stop state. Here, for the Robert s cross circuit, pseudo-radom umber sources with periods of ad 2 8 2, are implemeted. Whe the state (the output umber) of LFSR 2 equals the stop state, LFSR 2 is restarted to its iitial state. Y i,j 1 For the clock divisio structure, LFSR 2 is clock divided by the period of LFSR 1 through detectig the all oe state usig a AND gate. Similarly, the rotatio structure uses a AND gate to ihibit or stall every 2 1 cycles whe the all oe state is detected. These uits are used as the umber source i the stochastic stream geerator show i Fig. 2. For the uary streambased prior determiistic approaches, we optimized ad implemeted the couter-based architectures of [8]. For the covetioal radom stream-based implemetatios, we used two differet 8-bit or two differet 16-bit LFSRs as the required sources of radom umbers. We used the Syopsys Desig Compiler vh with a 45m gate library to sythesize the desigs. As show i Table 2, the hardware area cost of the proposed determiistic desigs is slightly (<10%) more tha that of their correspodig prior determiistic implemetatios. Due to replacig couters with LFSRs i the proposed architectures, the power cosumptio has also icreased i all cases. The importat metric, however, i evaluatig the efficiecy of the implemeted desigs is eergy cosumptio, defied as the product of the power cosumptio ad processig time. We evaluate the eergy-efficiecy of the differet desigs by measurig the eergy cosumptio of each oe i achievig a specific accuracy i processig the iputs. MAE is used as the accuracy metric (a lower MAE meas a higher accuracy). To comprehesively test the desigs, we simulate the operatio of the Robert s cross circuit i each desig approach by processig 10,000 sets of 8-bit precisio radom iput values. For accurate represetatio of iput values i each desig approach, we radomly choose a iteger value betwee zero ad the period of the (pseudo-radom) umber geerator ad divide it by the period. Fig. 8 ad Fig. 9 preset the MAE ad the stadard deviatio of processig radom iput values i differet desig approaches. Table 2 further shows the umber of processig cycles ad the eergy cosumptio of each desig to achieve differet accuracies.

6 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 6 TABLE 2 Area (um 2 ), Power (mw ) (@ 1GHz), ad Eergy cosumptio (pj) of the Robert s cross stochastic circuit sythesized with the 8- ad 16-bit covetioal radom approach, ad also the prior structures ad the proposed structures of the three determiistic approaches (Relatively Prime Legths, Clock Divisio, ad Rotatio). Target Error (MAE) Desig Area Power 0% 0.1% 0.3% 0.5% 1.0% 2.0% 3.0% Approach (µm 2 ) (mw ) Cycle Eergy Cycle Eergy Cycle Eergy Cycle Eergy Cycle Eergy Cycle Eergy Cycle Eergy Cov-Radom Cov-Radom ˆ16> 10ˆ5> , , , , PrimeL-Prior , , , , , , ,485 PrimeL-Proposed , , , , CLKDIV-Prior , , , , , , ,188 CLKDIV-Proposed , , , , , , ,570 Rotatio-Prior , , , , , , ,440 Rotatio-Proposed , , , , Mea Absolute Error (%) ^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 Operatio Cycles CLKDIV-Prior Rotatio-Prior PrimeLegth-Prior Cov-Radom-8 CLKDIV-Proposed Rotatio-Proposed PrimeLegth-Proposed Cov-Radom-16 Stadard Deviatio ^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 Operatio Cycles CLKDIV-Prior Rotatio-Prior PrimeLegth-Prior Cov-Radom-8 CLKDIV-Proposed Rotatio-Proposed PrimeLegth-Proposed Cov-Radom-16 Fig. 8. Mea Absolute Error (%) whe processig radom iput values with the Robert s cross stochastic circuit usig differet stochastic approaches. Fig. 9. Stadard deviatio of the absolute error of processig radom iput values with the Robert s cross stochastic circuit usig differet stochastic approaches. Whe completely accurate results are expected, the proposed desigs must ru for the same umber of cycles as required by the prior determiistic desigs (product of the periods of the umber geerators). Cosiderig the higher power cosumptio of the proposed desigs, the prior determiistic implemetatios cosume less eergy to achieve completely accurate results. The great advatage of the proposed architectures starts whe slight iaccuracy i the computatio is acceptable. I such cases, the proposed desigs start showig a much lower eergy cosumptio by covergig to the expected accuracy i a much shorter time. For the relatively prime ad the rotatio approaches, the proposed desigs improve the processig time by 61% ad 55%, respectively, resultig i a eergy cosumptio savigs of 41% ad 33% whe acceptig a MAE of as low as 0.1%. For a MAE of 3.0%, these architectures cosume 324 ad 334 times lower eergy by improvig the processig time by up to 500X compared to prior architectures of [8]. For the clock divisio approach, the proposed desig is more eergy efficiet if at least a MAE of 1.0% is acceptable. The eergy cosumptio is reduced 10 times for this method for a MAE of 3%. Compared to the covetioal radom stream-based architectures (Cov-Radom-8 with 8-bit LFSRs ad Cov- Radom-16 with 16-bit LFSRs) the proposed structures are more eergy-efficiet tha the 16-bit covetioal architecture but are at the same level with the 8-bit implemetatio. The importat poit, however, is that the 8-bit covetioal architecture caot achieve a MAE of 1.0% or lower ad the 16-bit architecture requires a very log processig time ad cosumes sigificat eergy to get close to completely accurate results. 5 CONCLUSION Recet work o SC has show that computatio usig stochastic logic ca be performed determiistically ad accurately by properly structurig uary-style bit-streams. The hardware cost ad the latecy of operatios are much lower tha those of the covetioal radom SC whe completely accurate results are expected. For applicatios that slight iaccuracy is acceptable, however, these uary stream-based determiistic approaches must ru for a relatively log time to produce acceptable results. This processig time, which is ofte much loger tha the latecy of the covetioal radom SC i achievig the same accuracy levels, makes the determiistic approaches eergyiefficiet. While radomess was a source of iaccuracy i the covetioal radom stream-based SC, we exploited pseudoradomess i improvig the progressive precisio prop-

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 7 erty of the determiistic approaches to SC.

Whe slight iaccuracy is acceptable, however, sigificat improvemet i the processig time ad eergy cosumptio is observed compared to the prior uary stream-based determiistic approaches ad also the

7 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING: SPECIAL ISSUE ON EMERGING TECHNOLOGIES IN COMPUTER DESIGN 7 erty of the determiistic approaches to SC. Completely accurate results are still produced if ruig the operatio for the required umber of cycles. Whe slight iaccuracy is acceptable, however, sigificat improvemet i the processig time ad eergy cosumptio is observed compared to the prior uary stream-based determiistic approaches ad also the covetioal radom-stream based approaches. The proposed approach is applicable to ay operatio covered by the determiistic approaches of SC. ACKNOWLEDGMENTS This work was supported i part by Natioal Sciece Foudatio grat o. CCF Ay opiios, fidigs ad coclusios or recommedatios expressed i this material are those of the authors ad do ot ecessarily reflect the views of the NSF. REFERENCES [1] A. Alaghi ad J. P. Hayes. Survey of stochastic computig. ACM Tras. Embed. Comput. Syst., 12(2s):92:1 92:19, [2] A. Alaghi, C. Li, ad J. Hayes. Stochastic circuits for real-time image-processig applicatios. I Desig Automatio Coferece (DAC), th ACM / EDAC / IEEE, pages 1 6, May [3] A. Ardakai, F. Leduc-Primeau, N. Oizawa, T. Hayu, ad W. J. Gross. Vlsi implemetatio of deep eural etwork usig itegral stochastic computig. IEEE Trasactios o Very Large Scale Itegratio (VLSI) Systems, PP(99):1 12, [4] B. Brow ad H. Card. Stochastic eural computatio. i. computatioal elemets. Computers, IEEE Trasactios o, 50(9): , Sep [5] D. Fick, G. Kim, A. Wag, D. Blaauw, ad D. Sylvester. Mixedsigal stochastic computatio demostrated i a image sesor with itegrated 2d edge detectio ad oise filterig. I Proceedigs of the IEEE 2014 Custom Itegrated Circuits Coferece, pages 1 4, Sept [6] B. Gaies. Stochastic computig systems. I Advaces i Iformatio Systems Sciece, Advaces i Iformatio Systems Sciece, pages Spriger US, [7] H. Ichihara, S. Ishii, D. Suamori, T. Iwagaki, ad T. Ioue. Compact ad accurate stochastic circuits with shared radom umber sources. I 2014 IEEE 32d Iteratioal Coferece o Computer Desig (ICCD), pages , Oct [8] D. Jeso ad M. Riedel. A determiistic approach to stochastic computatio. I Proceedigs of the 35th Iteratioal Coferece o Computer-Aided Desig, ICCAD 16, pages 102:1 102:8, New York, NY, USA, [9] K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, ad K. Choi. Dyamic eergyaccuracy trade-off usig stochastic computig i deep eural etworks. I Proceedigs of the 53rd Aual Desig Automatio Coferece, DAC 16, pages 124:1 124:6, New York, NY, USA, ACM. [10] P. Koopma. Maximal Legth LFSR Feedback Terms koopma/lfsr/idex.html, [11] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, ad L. Ceze. Eergyefficiet hybrid stochastic-biary eural etworks for ear-sesor computig. I Proceedigs of the Coferece o Desig, Automatio & Test i Europe, DATE 17, pages 13 18, 3001 Leuve, Belgium, Belgium, Europea Desig ad Automatio Associatio. [12] B. Li, M. H. Najafi, ad D. J. Lilja. Usig stochastic computig to reduce the hardware requiremets for a restricted boltzma machie classifier. I Proceedigs of the 2016 ACM/SIGDA Iteratioal Symposium o Field-Programmable Gate Arrays, FPGA 16, pages 36 41, New York, NY, USA, ACM. [13] B. Li, Y. Qi, B. Yua, ad D. J. Lilja. Neural etwork classifiers usig stochastic computig with a hardware-orieted approximate activatio fuctio. I 2017 IEEE Iteratioal Coferece o Computer Desig (ICCD), pages , Nov [14] P. Li, D. Lilja, W. Qia, K. Bazarga, ad M. Riedel. Computatio o stochastic bit streams digital image processig case studies. Very Large Scale Itegratio (VLSI) Systems, IEEE Trasactios o, 22(3): , [15] Y. Liu ad K. K. Parhi. Architectures for recursive digital filters usig stochastic computig. IEEE Trasactios o Sigal Processig, 64(14): , July [16] A. Naderi, S. Maor, M. Sawa, ad W. Gross. Delayed stochastic decodig of ldpc codes. Sigal Processig, IEEE Trasactios o, 59(11): , Nov [17] M. H. Najafi, S. Jamali-Zavareh, D. J. Lilja, M. D. Riedel, K. Bazarga, ad R. Harjai. Time-ecoded values for highly efficiet stochastic circuits. IEEE Tras. o Very Large Scale Itegratio (VLSI) Systems, 25(5):1 14, [18] M. H. Najafi, D. J. Lilja, M. D. Riedel, ad K. Bazarga. Polysychroous clockig: Exploitig the skew tolerace of stochastic circuits. IEEE Trasactios o Computers, 66(10): , Oct [19] M. H. Najafi, D. J. Lilja, M. D. Riedel, ad K. Bazarga. Power ad Area Efficiet Sortig Networks usig Uary Processig. I Computer Desig (ICCD), 2017 IEEE 35th Iteratioal Coferece o, [20] M. H. Najafi ad M. E. Salehi. A Fast Fault-Tolerat Architecture for Sauvola Local Image Thresholdig Algorithm usig Stochastic Computig. IEEE Trasactios o Very Large Scale Itegratio (VLSI) Systems, 24(2): , Feb [21] W. Qia, X. Li, M. Riedel, K. Bazarga, ad D. Lilja. A architecture for fault-tolerat computatio with stochastic logic. Computers, IEEE Tras. o, 60(1):93 105, Ja [22] S. Tehrai, S. Maor, ad W. Gross. Fully parallel stochastic ldpc decoders. Sigal Processig, IEEE Trasactios o, 56(11): , Nov M. Hassa Najafi (S 15) received the B.Sc. degree i computer egieerig from Uiversity of Isfaha, Isfaha, Ira, ad the M.Sc. degree i computer architecture from Uiversity of Tehra, Tehra, Ira, i 2011 ad 2014, respectively. He is curretly pursuig the Ph.D. degree with ARCTIC Labs, Departmet of Electrical ad Computer Egieerig, Uiversity of Miesota, Twi cities, MN, USA. His curret research iterests iclude stochastic ad approximate computig, computer-aided desig of itegrated circuits, low-power desig, ad desigig fault tolerat systems. I recogitio of his research, he received the Doctoral Dissertatio Fellowship at the Uiversity of Miesota ad the Best Paper Award at the th IEEE Iteratioal Coferece o Computer Desig. David J. Lilja (F06) received the B.S. degree i computer egieerig from Iowa State Uiversity i Ames, IA, USA, ad the M.S. ad Ph.D. degrees i electrical egieerig from the Uiversity of Illiois at Urbaa-Champaig i Urbaa, IL, USA. He is curretly the Schell Professor of Electrical ad Computer Egieerig at the Uiversity of Miesota i Mieapolis, MN, USA, where he also serves as a member of the graduate faculties i Computer Sciece, Scietific Computatio, ad Data Sciece. Previously, he served te years as the head of the ECE departmet at the Uiversity of Miesota, ad worked as a research assistat at the Ceter for Supercomputig Research ad Developmet at the Uiversity of Illiois, ad as a developmet egieer at Tadem Computers Icorporated i Cupertio, Califoria. He was elected a Fellow of the Istitute of Electrical ad Electroics Egieers (IEEE) ad a Fellow of the America Associatio for the Advacemet of Sciece (AAAS).

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity