PAPER A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition

Size: px

Start display at page:

Download "PAPER A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition"

Eleanore Clark
6 years ago
Views:

1 562 PAPER A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition Yong KIM a), Student Member and Hong JEONG, Nonmember SUMMARY In this paper, we present an efficient architecture for connected word recognition that can be implemented with field programmable gate array (FPGA). The architecture consists of newly derived two-level dynamic programming (TLDP) that use only bit addition and shift operations. The advantages of this architecture are the spatial efficiency to accommodate more words with limited space and the absence of multiplications to increase computational speed by reducing propagation delays. The architecture is highly regular, consisting of identical and simple processing elements with only nearest-neighbor communication, and external communication occurs with the end processing elements. In order to verify the proposed architecture, we have also designed and implemented it, prototyping with Xilinx FPGAs running at 33 MHz. key words: speech recognition, hidden Markov model (HMM), two-level dynamic programming (TLDP), FPGA 1. Introduction Speech recognition is a process that allows a computer to map acoustic speech signals to text. That is, speech recognition converts acoustic speech signals provided by a microphone or a telephone into words, a group of words, or sentences. Recognition results may be used as final results by application fields such as instructions, controls, data inputs, and documentation, and may also be used as inputs of language processing in the field of speech understanding. Furthermore, speech recognition is an attractive technique allowing interactive communication between humans and computers, making computer usage environments more convenient for human beings. For most speech recognition applications, it is sufficient to produce results in real time, and software solutions that perform recognition in real time already exist. However, to increase the use of speech recognition in embedded systems, we need a speech recognition chip with low power consumption, small size, and low cost. In previous works, dedicated hardware architectures for hidden Markov model (HMM)-based speech recognition were introduced in [1] [8]. These are summarized in Table 1. The existing architectures are designed for isolated speech recognition. Unfortunately, there is no direct implementation on hardware for connected speech recognition. In this paper, we derive a new architecture based on bit additions and shift operations only, excluding any integer multiplications for Manuscript received May 22, Manuscript revised September 18, The authors are with the department of electronics and electrical engineering, POSTECH, Pohang, Kyungbuk, , Korea. a) ddda@postech.ac.kr DOI: /ietisy/e90 d connected word recognition, using the TLDP [12], [13] algorithm. In the connected word recognition, most of the problems arise fromthe difficulty in reliably determining the word boundaries. TLDP is a well-known speech recognition algorithm that assigns word strings to speech segments. We introduce an efficient linear systolic array architecture that is appropriate for FPGA implementation. The array is highly regular, consisting of identical and simple processing elements. The design is very scalable, and since these arrays can be concatenated, it is also easily extensible. A scalable technique always provides optimum hardware resources that can cope with variable conditions by making small modifications to the hardware architecture. The present architecture relates to the chip design technique based on the ASIC and FPGA, and allows the realization of small devices with lower power consumption and low costs by developing an algorithm optimized to the chip. The hardwired speech recognition system allows easy installation in a device that uses speech recognition through a small and convenient interface without a computer, and allows realtime speech recognition due to the parallel architecture. The organization of this paper is as follows. Section 2 derives the TLDP algorithm for connected speech recognition. Section 3 shows the detailed systolic architecture of TLDP. The test results are discussed in Sect. 4 and finally, conclusions are given in Sect Background of Two Level Dynamic Programming Algorithm When boundaries are unclear (connected speech case), the TLDP algorithm can be used to find them quite well. A very brief outline is given below. For a more detailed description of TLDP theory, see [11]. The notation here is based on [11]. 2.1 Basic Principles of TLDP The basic idea of the TLDP is to break up the computation of connected speech recognition into two stages. At the first level, the algorithm matches each individual word reference pattern, R v, against an arbitrary portion of the test string, T. T and R v utterances are experssed as in (1). T = {t(1), t(2),, t(m)} = {t(m)} m=1 M, R v = {r v (1), r v (2),, r v (N v )} = {r v (m)} Nv m=1, (1) where t(m) is a test pattern, r v (m)(1 v V) is a pattern of Copyright c 2007 The Institute of Electronics, Information and Communication Engineers

2 KIM and JEONG: A SYSTOLIC FPGA ARCHITECTURE 563 Table 1 Comparison of HMM systems. Author Year Implementation method Vocabulary size V. Upadhyaya [1] 1993 Using Multiple-Compare-Select (MCS) operation B.S. Kim [2] 2000 Using IHMM, removing redundant computationofpathmatrix F.L. Vargas [3] 2001 Hardware/software co-design implementation approach, small-speech recognition Recognition rate over % J.M. Jou [4] 2001 Using look-ahead pipelining technique 25,000 92% B.G. Park [5] 2002 Using modified Viterbi scoring procedure and precomputing logic S. Yoshizawa [6] 2002 Using continuous HMM (CHMM) based speech recognition W. Han [7] 2003 Using multi-mixture Gaussian observation % probability within each state of the models F.A. Elmisery [8] 2003 Using modified HMM algorithm, isolated 98% Arabic word recognition G.C. Caradarilli [9] 2004 Continuous-speech speaker-independent ASR systems 200 the v th word from among V reference patterns to be recognized, and N v is the duration of the v th word reference pattern. For the range of beginning test frames of the match, s, 1 s M, and for the range of ending test frames, e, 1 e M (e > s), the minimum distance ˆD(v, s, e), for every possible vocabulary pattern, R v, between each possible pair of beginning and ending frames (s, e)isdefinedas ˆD(v, s, e) = min w(m) e d(t(m), r v (w(m))), (2) m=s where d(, ) is a local spectral distance measure, and w(m)is a window for dividing the total frame input for signal analysis during a very short time that is assumed to be stable. We can eliminate v by finding the best match between s and e for any v,giving D(s, e) = min [ ˆD(v, s, e)] = best score, 1 v V Ñ(s, e) = arg min [ ˆD(v, s, e)] (3) 1 v V = best reference index, thereby significantly reducing the data storage without losing optimality. Given the array of best scores, D(s, e), the second level of the computation pieces together the individual reference pattern scores to minimize the overall accumulated distance over the entire test string. This can be accomplished using dynamic programming as D l (e) = min [ D(s, e) + D l 1 (s 1)], (4) 1 s<e where D l (e) is the distance of the best path ending at frame e using a concatenated sequence of l reference patterns. The best path ending at frame e using exactly l reference patterns is the one with minimum distance over all possible beginning frames, s, of the concatenation of the best path ending at frame s 1 using exactly l 1 reference patterns plus the distance of (3) of the best path from frame s to frame e. 3. The Systolic Architecture of Two-Level Dynamic Programming Figure 1 is a basic block diagram of a TLDP. The system includes a first processing element group, a comparison module, a second processing element group, and a backtracking module. The first processing element group includes a plurality of parallel processing elements that have the same configuration, and calculate matching costs by using the HMM algorithm. The comparison module determines the minimum matching cost from the first processing element group, and stores it for later calculation. The second processing element group finds the optimized matching cost with the reference pattern for the total frame by using the minimum value determined by the comparison module, detects the word s end point, and recognizes a connected word. The second processing element group also includes a plurality of parallel processing elements having the same configuration. The backtracking module finds a word arrangement of the reference pattern that corresponds to the speech recognition result based on the calculation result by the second processing element group. In this instance, the first processing element group and the comparison module form the first level dynamic programming (first level DP), while the second processing element group and the backtracking module form the second level dynamic programming (second level DP). 3.1 Architecture of First Level DP The first group processing elements use the HMM algorithm and the dynamic programming scheme to calculate the matching costs. For example, the matching cost PE lev.1 (v, p, m)atthep th processing element is: PE lev.1 (v, p = s, m = e) = ˆD(v, s, e) (5) 1 p, m M,

3 564 Fig. 1 Overall structure of TLDP. (a) Overall architecture (b) Architecture of comparison module Fig. 2 Architecture of first level DP. where M is the dimension of the total frame. (5) shows the matching cost between the test pattern and the reference pattern during the interval of (s, e). Hardware architectures for HMM-based speech recognition were introduced in [1] [8], thereby not presented in this paper. As demonstrated in (5) and Fig. 2, the p th processing element sequentially calculates the matching costs from p to M when the start point is given to be p. The number of functioning processing elements is M; hence, the matching costs from all the start points to all the end points can be calculated. Therefore, realization of the above process through software requires the matching time of M 2 clock signals. However, realization through the parallel hardwired configuration of the present architecture generates the same calculation results by using M clock signals, corresponding to the dimension of the total frame. Figure 2 (a) shows the systolic array architecture of the first level DP. The first level DP includes the first processing element group and the comparison module. The first processing element group includes a state input unit and a plurality of parallel processing elements that have the same configuration for calculating the matching cost. The first level DP calculates the matching costs of a test pattern in comparison with the reference patterns at a start point and an end point by using the HMM algorithm and the dynamic programming scheme, determines the minimum matching cost, and extracts an index of the reference pattern corresponding to it. That is, since the start points for comparing the test pattern and the reference pattern are established at different values, the matching costs that have the respective

4 KIM and JEONG: A SYSTOLIC FPGA ARCHITECTURE 565 components as start points may be calculated by using M input clock signals when the test pattern has M components. When the state input unit receives a feature vector of a speech signal from the feature vector generator, the HMM parameters state transition probability distribution, A v,and observation symbol probability distribution, B v,m are calculated according to the learned probabilistic value, which are provided to the state input unit. To calculate the matching costs, the HMM parameters are sequentially input to the processing elements as clock signals. Figure 2 (b) shows a detailed comparison module of Fig. 2 (a). The comparison module stores the minimum value matching costs from the first processing element group and an index to the corresponding reference pattern. The calculation of the minimum matching cost is: C memory (v, p, m) = min[c memory (v 1, p, m), PE lev.1 (v, p, m)], I memory (v, p, m) = arg min[c memory (v 1, p, m), PE lev.1 (v, p, m)], (6) Fig. 3 Second level dynamic programming. where C memory (v, p, m) is the stored matching cost, and I memory (v, p, m) is the corresponding index. As (6) shows, the previously stored minimum matching cost, C memory (v 1, p, m) is compared to the current input matching cost, PE lev.1 (v, p, m), and the lesser one is stored in the memory. That is, the minimum one among the matching costs that are input up to a specific time is stored in the memory. In this instance, since the values of e in PE lev.1 (v, p, m) are sequentially input from 1 to M, the memory in the comparison module is configured to have M first-input firstoutput (FIFO) memories for sequential comparative calculation. Also as shown in Fig. 2 (a), the vertical axis stores the (a) Overall architecture (b) Architecture of second level processing element Fig. 4 Architecture of second level DP.

5 566 Fig. 5 The block diagram of the overall hardware. Table 2 Results from TLDP implementation (M = 40). First Level DP Second Level DP TLDP Number of Slices 32,692 20,629 44,027 Number of Slice FF 46,184 5,801 52,391 Number of 4 input LUTs 49,883 36,490 87,394 Max. Frequency MHz MHz MHz cost of the start point and the horizontal axis stores the cost of the end point. In this instance, since the start point cannot be greater than the end point, the available values correspond to those with slash marks in the comparison module of Fig. 2 (a). The required information is not only the minimum cost but also the corresponding word index. Therefore, all memory elements store minimum cost and index at v = V. C memory (V, p = s, m = e) = D(s, e), I memory (V, p = s, m = e) = Ñ(s, e). 3.2 Architecture of Second Level DP At the second level DP, we compute the distance of the best path ending at frame e, D l (e) using a concatenated sequence of l reference patterns as in (4). Figure 3 illustrates an algorithm for finding the optimized matching cost D l (e). Let us define the cost of p th second level DP processing element at l reference patterns as (7) PE lev.2 (l, p = e) = min [ D(s, e) + D l 1 (s 1)] 1 s<e = D l (e), (8) where D 0 (0) is 0, and D l (0) is (1 l L max ). As shown in Fig. 3, the second level finds D l (e) by using the values of D l 1 (0), D l 1 (1),, D l 1 (e 1) and the number of cases to be compared increases when the value of e increases. That is, as demonstrated in (8) and Fig. 3, the second level adds D(s, e) found by the first level to D l 1 (s 1) by using (l 1) reference patterns and the best path ending at frame (s 1), and finds the matching costs with the l reference patterns by using the dynamic programming scheme. As shown in Fig. 4 (a), the second level includes a second processing element group with a plurality of processing elements that have the same configuration, and a register for storing the matching costs calculated by the respective processing elements. The second level is easy to design and modify since all processing elements have the same configuration. The remaining task is to describe the internal structure of the second level DP processing element. Figure 4 (b) shows the processing element. The trapezoidal block represents comparators. The block chooses the smaller of D(s, e)+ D l 1 (s 1) and E l (s 1) for M clock. At M+1clock it updates D l (s) with the register value, E l (s). Notice that no multiplier is involved in this design, or in other parts of the system. As shown in Fig. 4 (b), the processing elements of the second processing element group sequentially receive the value of D(s, e) from the memory module, and concurrently receive the value of D l 1 (s 1) calculated and stored in the register. While M clock signals are applied, these two types of input values D(s, e) and D l 1 (s 1), are transmitted to the processing elements, the minimum one D l (e) is selected among the sums of the two inputs, and the value of D l (e) is output when the (M + 1)th clock signal is applied. The output value of D l (e) is stored in the register, and the matching cost thereof with the (l + 1)th reference pattern is calculated. That is, the processing elements repeatedly calculate the matching costs with the reference patterns during the M clock signals to provide update results to the register at the (M + 1)th clock signal. After this has been done for L max, the final matching cost is found by using all the D l (e) values stored in the register.

6 KIM and JEONG: A SYSTOLIC FPGA ARCHITECTURE 567 The backtracking module performs traceback on the reference patterns stored in the memory by using the final matching cost D l (e), and extracts a corresponding reference index, thereby recognizing the speech signals. 4. System Implementation and Experimental Results The system configuration is shown in Fig. 5. The speech signal is bandlimited and sampled at 16 KHz with 12 bits. Feature extraction gives the Mel Frequency Cepstral Coeffients (MFCCs) of 13-dimension vector, each with 12 bits from pre-processing data. The HMM parameters were extracted using the feature vectors in preprocessor and the trained data in memory. The pattern matching element chooses the reference that matches the signal parameters set from the input with the HMM and TLDP algorithms. The system is designed for an FPGA (Xilinx Virtex-II XC2V8000) running at 33 MHz. The entire chip is designed with VHDL code, fully tested and error free. The following experimental results are all based upon the VHDL simulation. The chip has been simulated extensively with the Cadence simulation tools. It is designed to interface with the PLX9656 PCI chip which has a maximum clock frequency of 66 MHz. The PLX9656 PCI is used within a PC with a Pentium 4, 3.06 GHz processor. The full design (M = 40) occupies 44,027 of the XC2V8000 s slices, equal to 94%, requiring 87,394 LUTs and 52,391 FFs (See Table 2). The speech data used for the testing and training were taken from the database designed by the Speech Technology Research Center of Korea. Both the test and training groups consisted of 10 male and 10 female speakers. We have achieved a very good performance with a 91.4% correctness rate in a vocabulary of more than 500 words. 5. Conclusion Hardware needs different algorithms for the same application in terms of performance and quality. Since the algorithms used for hardware and software implementation differ significantly, it will be difficult, if not impossible, to migrate software implementations directly to hardware implementations. We have presented a fast and efficient architecture and implementation of a previously presented TLDP algorithm. A systolic TDLP was derived and tested with VHDL code simulation. This scheme is fast and reliable since the architectures are highly regular. In addition, the processing can be done in real time owing to the parallel hardware implementation. A full scale system can be easily obtained by scaling the number of processing elements and the number of words [2] B.-S. Kim, B. Park, J.-D. Cho, and Y.-H. Chang, Low power Viterbi search architecture using inverse hidden Markov model, Signal Processing Systems, SiPS IEEE Workshop on, pp , [3] F.L. Vargas, R.D.R. Fagundes, and D.B. Junior, A FPGA-based Viterbi algorithm implementation for speech recognition systems, Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP 01) IEEE International Conference on, vol.2, pp , [4] J.M. Jou, Y.-H. Shiau, and C.-J. Huang, An efficient VLSI architecture for HMM-based speech recognition, Electronics, Circuit and Systems, ICECS The 8th IEEE International Conference, vol.1, pp , [5] B.-G. Park, K.-S. Cho, and J.-D. Cho, Low power VLSI architecture of Viterbi scorer for HMM-based isolated word recognition, Quality Electronic Design, Proceedings. International Symposium on, pp , [6] S. Yoshizawa, Y. Miyanaga, and N. Wada, A low-power VLSI design of an HMM based speech recognition system, Circuits and Systems, MWSCAS The th Midwest Symposium on, vol.2, pp.ii-489 II-492, [7] W. Han, K.-W. Hon, C.-F. Chan, T. Lee, C.-S. Choy, K.-P. Pun, and P.C. Ching, An HMM-based speech recognition IC, Circuits and Systems, ISCAS 03. Proceedings of the 2003 International Symposium on, vol.2, pp.ii-744 II-747, [8] F.A. Elmisery, A.H. Khalil, A.E. Salama, and H.F. Hammed, A FPGA-based HMM for a discrete Arabic speech recognition system, Microelectronics, ICM Proceedings of the 15th International Conference on, pp , [9] G.C. Caradarilli, A. Malatesta, M. Re, L. Arnone, and S. Bocchio, Hardware oriented architectures for continuous-speech speakerindependent ASR systems, Signal Processing and Information Technology, Proceedings of the Fourth IEEE International Symposium on, pp , Dec [10] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol.77, no.2, pp , [11] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, pp , Prentice-Hall, [12] H. Sakoe, Two-level DP-matching A dynamic programming-based pattern matching algorithm for connected word recognition, IEEE Trans. Acoust. Speech Signal Process., vol.27, no.6, pp , Dec [13] S. Nakagawa, A connected spoken word recognition method by O(n) dynamic programming pattern matching algorithm, IEEE International Conference on ASSP, vol.8, pp , April [14] H. Ney, A comparative study of two search strategies for connected word recognition: Dynamic programming and heuristic search, IEEE Trans. Pattern Anal. Mach. Intell., vol.14, no.5, pp , May [15] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, References [1] V. Upadhyaya, S.J. Upadhyaya, and A. Kundu, A parallel VLSI implementation of Viterbi algorithm for accelerated word recognition, VLSI, Design Automation of High Performance VLSI Systems, Proceedings. Third Great Lakes Symposium on, pp.37 41,

568 Yong Kim received the BS degree from the MSE and EE Dept. at POSTECH, in 2000, and the MS degree from the EE Dept. at POSTECH in 2002.

Hong Jeong received the BS degree from the EE Dept at the Seoul National University in 1977. In 1979, he received the MS degree from the EE Dept. at KAIST.

7 568 Yong Kim received the BS degree from the MSE and EE Dept. at POSTECH, in 2000, and the MS degree from the EE Dept. at POSTECH in Since 2002, he has been working towards the PhD degree at POSTECH. His current research interests include speech signal processing. Hong Jeong received the BS degree from the EE Dept at the Seoul National University in In 1979, he received the MS degree from the EE Dept. at KAIST. During , he received the SM, EE, and PhD degrees in EECS Dept. from MIT. During the period of , he taught at the Kyungbuk National University. Since 1988, he has worked at POSTECH, where he is now an associate professor. During , he worked in the Bell Labs at Murray Hill. During , he was on leave for USC as a visiting professor. His major research area is multimedia signal processing.

Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic

Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic S.J. Melnikoff, S.F. Quigley & M.J. Russell School of Electronic and Electrical Engineering, University of Birmingham,