Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang

Size: px

Start display at page:

Download "Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang"

Brittney Hill
5 years ago
Views:

1 Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang

No. 2 2,100 employees, of which 76% are technology staff, the highest in

degrees PC MAU 520MM, mobile MAU 560MM, covering 96% of the Internet

2 Sogou Company No. 2 Chinese Internet company in terms of user base Strong R&D Capabilities No. 2 2,100 employees, of which 76% are technology staff, the highest in China s Internet industry 38% of employees hold graduate or doctor degrees PC MAU 520MM, mobile MAU 560MM, covering 96% of the Internet users in China Robust revenue growth Revenue CAGR of 126% from 2011 to 2015, And In 2015 revenue reached $ 592 million, profit of $ 110 million.

3 Rich Product Line Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen Sogou Encyclopedia Sogou Guidance Sogou Exclusive : WeChat search Zhihu search English search

4 Outline 1. Neural Machine Translation 2. Related application scenarios

5 Machine Translation Automatically translate one sentence of source language into target language 布什与沙龙举行了会谈 Bush held talks with Sharon Methods Rule-based machine translation (RBMT) Example-based machine translation (EBMT) Statistical Machine Translation (SMT) 5

Neural Machine Translation A New Era To model the direct mapping between source and target language by neural network 布什与沙龙举行了会谈 Bush held talks with Sharon Really amazing translation quality 25 20

6 Neural Machine Translation A New Era To model the direct mapping between source and target language by neural network 布什与沙龙举行了会谈 Bush held talks with Sharon Really amazing translation quality Edinburgh s WMT Results Over the Years phrase-based SMT syntax-based SMT neural MT From ( Sennrich 2016, )

$one by one based on the vector from Encoder 布什与沙龙举行了会谈 <\s> Bush held talks with Sharon <\s> What do we actually have in the encoded vector? 7 (Sutskever et al.$

7 Neural Machine Translation A New Era Encoder-Decoder Framework Encoder: represent the source sentence as a vector by neural network Decoder: generate target words one by one based on the vector from Encoder 布什与沙龙举行了会谈 <\s> Bush held talks with Sharon <\s> What do we actually have in the encoded vector? 7 (Sutskever et al., 2014)

$calculate the source language information related to it 布什与沙龙举行了会谈 <\s > Weighted average Bush held talks$

8 Neural Machine Translation A New Era Attention Mechanism For each target word to be generated, dynamically calculate the source language information related to it 布什与沙龙举行了会谈 <\s > Weighted average Bush held talks 8

9 Sogou Neural Machine Translation Engine A pure neural-based commercial machine translation engine Stacked encoders and decoders Residual network Length normalization Domain adaptation Dual Learning Zero-shot Learning 9 布什 Encoder hidden states 与沙龙举行了会谈 Attention Mechanism Bush held talks with Sharon Softmax

10 Sogou Neural Machine Translation Engine Keep optimizing our translation engine on translation model, bilingual data mining, distributed training and decoding. Focus on Chinese-English and English-Chinese translation now Good performance on Chinese-English and Engilsh-Chinese translation Human Evaluation on English-Chinese Translation Human Evaluation on Chinese-English Translation Sogou 2.9 Initial performance Current performance Sogou Initial performance 4.2 Current performance 10

11 Challenges in Real Application Training is too slow!!!!! (Sutskever et al., 2014) (Wu et al., 2016) Decoding is slow less than 200ms per translation request on average to meet the real time standard Take a one layer GRU NMT system as an example Vocabulary size: Word embedding: 620 Hidden state: 1000 Encoder(bidirection): ~ 16M MACs per word (just forward) 2*3*2000* *3*620*1000 Decoder: ~70M MACs per word (just forward) For Training: 3*3620* *2000* *620 For BeamSearch inference: Decoder computation is BeamSize times larger! We need fast training and decoding 11

12 Parameter server Distributed Training Keep current model parameters Receive gradients from workers, and update parameters accordingly Workers Make use of GPU for model training Communicate with Parameter server to update parameters 12

13 Distributed Training Asynchronous Each worker send local updated parameters to Parameter server Parameter server averages the parameters from worker with its own version Return the updated parameter to worker Synchronous Each worker send its gradients to Parameter server Parameter server do parameter updating after it receives the gradients from all workers 13

14 Acceleration ratio Distributed Training Acceleration ratio Asynchronous around 3x acceleration with 10 GPU cards Synchronous Acceleration ratio v.s. number of GPU (same batchsize * number of GPU) Acceleration efficiency number of GPU

15 Training acceleration Acceleration on single card Corpus shuffle Global random shuffle Local Sort sort by sentence length inside each 20 mini-batches in each mini-batch, sentence length is similar Optimization function selection Adadelta Momentum Adam about 2 times faster than above 15

16 batch time Training acceleration Acceleration on single card Use better GPU or newer CUDA if possible batch time(s) speed up (X) 16

17 Decoding acceleration Compute acceleration fusion of Computations fusion element wise operations together fusion matrix multiplications to larger ones also fusion parameter matrix ahead of time fusion input embeding projection together instead of at each step CUDA function selection for batchsize=1, use level 2 cublas function instead of level 3 17

Decoding acceleration Batch Processing about 3x faster than single sentence use batch mode if possible Sentence reordering sentence length may vary

18 Decoding acceleration Batch Processing about 3x faster than single sentence use batch mode if possible Sentence reordering sentence length may vary greatly Encoder reorder sentence by length scale batchsize at each step Decoder rearrange beams at each step also scale batchsize according to left beams 18

19 batch time Decoding acceleration Other acceleration methods Use better GPU or newer CUDA if possible batch time(s) speed up (X)

20 comparison with training P40 v.s. P100 P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s batchsize training: 80 or more Computation dominate inference: 10 or less memory bandwidth also play an important role 20

21 Outline 1. Neural Machine Translation 2. Related application scenarios 21

22 Sogou translate related products Translation box in search results Translation Vertical channel Translation with OCR 22

23 Sogou translate related products Oversea search Chinese query machine translatio n English query English results machine translatio n machine translatio n Chinese abstract Chinese webpages 23

24 24

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?