Knowledge-Based Word Lattice Rescoring in a Dynamic Context. Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow

Size: px

Start display at page:

Download "Knowledge-Based Word Lattice Rescoring in a Dynamic Context. Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow"

Ethelbert Snow
5 years ago
Views:

1 Knowledge-Based Word Lattice Rescoring in a Dynamic Context Todd Shore, Friedrich Faubel, Hartmut Helmke, Dietrich Klakow

2 Section I Motivation

3 Motivation Problem: difficult to incorporate higher-level knowledge sources into automatic speech recognition (ASR)

However: there are domains in which the situational context of

4 Motivation Problem: difficult to incorporate higher-level knowledge sources into automatic speech recognition (ASR) However: there are domains in which the situational context of utterances is available, e.g. air traffic control (ATC) or command and control tasks

5 Motivation Problem: difficult to incorporate higher-level knowledge sources into automatic speech recognition (ASR) However: there are domains in which the situational context of utterances is available, e.g. air traffic control (ATC) or command and control tasks Approach taken in this work: incorporate contextual knowledge into ASR by rescoring word lattice output of an ATC task

6 Section II The Air Traffic Control Task

7 The Air Traffic Control Task Air traffic controllers at their workstations

8 The Air Traffic Control Task Air traffic controllers at their workstations Primary objective: maintain aircraft separation safely guide approaching aircraft to their runway threshold integrate departing and passing aircraft

9 The Air Traffic Control Task Air traffic controllers at their workstations Have access to: radar screens revealing aircraft positions and speeds flight plans weather reports, indicators for speed of wind, etc.

10 The Air Traffic Control Task Air traffic controllers at their workstations Implementation: issue verbal commands to aircraft pilots use of a standardized subset of English which is formally specified by the International Civil Aviation Organization (ICAO)

11 The Air Traffic Control Task ICAO Phraseology for ATC commands comprises single aircraft callsign (identifying the aircraft) goal actions to execute goal values to be achieved

12 The Air Traffic Control Task ICAO Phraseology for ATC commands comprises single aircraft callsign (identifying the aircraft) goal actions to execute goal values to be achieved

13 The Air Traffic Control Task ICAO Phraseology for ATC commands comprises single aircraft callsign (identifying the aircraft) goal actions to execute goal values to be achieved Example: Delta four three niner turn right heading two two zero

14 The Air Traffic Control Task ICAO Phraseology for ATC commands comprises single aircraft callsign (identifying the aircraft) goal actions to execute goal values to be achieved Example: Delta four three niner turn right heading two two zero Relevant Information: DL 439 TURN, DIR=right, HDG=220

15 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content

16 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero

17 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

18 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

19 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

20 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

21 The Air Traffic Control Task Use of a recognition grammar is sufficient for recognizing ICAO phraseology simplifies extraction of semantic content Example: Delta four three niner turn right heading two two zero XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

22 The Air Traffic Control Task Main idea of this work: rescore recognized utterances through use of context knowledge about the situation in the airspace

23 The Air Traffic Control Task Main idea of this work: rescore recognized utterances through use of context knowledge about the situation in the airspace But where does this information come from?

24 The Air Traffic Control Task Modernized ATC workspace as envisioned by the German Aerospace Center (DLR)

25 The Air Traffic Control Task including the arrival manager 4D-CARMA, which assists controllers in managing aircraft arrivals

The Air Traffic Control Task including the arrival manager 4D-CARMA, which assists controllers in managing aircraft arrivals This system allows us to

26 The Air Traffic Control Task including the arrival manager 4D-CARMA, which assists controllers in managing aircraft arrivals This system allows us to extract: the callsigns of aircraft in the airspace the aircraft positions relative to the radar Their speeds, altitudes, climb/descend rates, reduce rates, etc.

27 Section III Knowledge-Based Lattice Rescoring

28 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search

29 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search eight EY -b(100) q 167 /callsign ε callsign ε q21 q22 speedbird S -b(141) q 23 two T -b(175) q 307 /callsign ε three T -b(67) q 233 /callsign ε

30 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly

31 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly callsign q q21 22 speedbird q 23 eight q 167 /callsign

32 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly callsign q q21 22 speedbird q 23 eight q 167 /callsign semantic template callsign:

33 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly callsign q q21 22 speedbird q 23 eight q 167 /callsign semantic template callsign: BA

34 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly callsign q q21 22 speedbird q 23 eight q 167 /callsign semantic template callsign: BA 8

35 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search This allows us to directly extract semantic frames by simply adding a pointer to a semantic template structure, which is filled on the fly callsign q q21 22 speedbird q 23 eight q 167 /callsign semantic template callsign: BA 8

36 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 discourse representation

37 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 check if exists discourse representation

38 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 check if exists if no apply penalty

39 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 q 167 : /callsign edge information - start / end time - acoustic / LM score q 273

40 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 q 167 : /callsign edge information - start / end time - acoustic / LM score - knowledge score q 273

41 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 discourse representation

42 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 discourse representation

Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid

43 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 check speed discourse representation

44 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 check speed apply penalty

45 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 kn q 729 : /speed ( q, q ) c i j K k 1 k cost k q 785 (callsign,cmd,val)

46 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 kn q 729 : /speed ( q, q ) c i j K k 1 k cost k q 785 (callsign,cmd,val) penalty score

47 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 kn q 729 : /speed ( q, q ) c i j K k 1 k cost k q 785 (callsign,cmd,val) constraint penalty function c k ( ) [0,1]

48 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Principal Idea: penalize invalid callsigns and unlikely command values based on discourse representation system semantic template callsign: BA 8 cmd: REDUCE val: SPD=220 q 729 : /speed total cost: q 785 ( ) () () () ac ac lm lm kn kn

49 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties

50 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties 1 Propagate scores along the nodes

51 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties callsign q q21 22 speedbird q 23 eight q 167 /callsign token - acscore - lmscore - knscore

52 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties callsign q q21 22 speedbird q 23 eight q 167 /callsign token - acscore - lmscore - knscore acscore ac lmscore lm knscore kn ac lm kn () () [? lp] ( )

53 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties callsign q q21 22 speedbird q 23 eight q 167 /callsign token - acscore - lmscore - knscore acscore ac lmscore lm knscore kn ac lm kn () () [? lp] ( )

54 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties 1 Propagate scores along the nodes

55 Knowledge-Based Lattice Rescoring WFST Decoder: can be used to generate phone-to-word transducer lattice during Viterbi search Rescoring: in analogy to rescoring with different language model scales and word insertion penalties 1 Propagate scores along the nodes 2 Trace back from best final state

56 Section IV Rescoring Experiments

57 Rescoring Experiments Corpus: recorded using the 4D-CARMA software from DLR includes aircraft state vectors (5 second intervals) total of 1,107 ATC commands 9.5 words per sentence approx. 100 minutes of speech

58 Rescoring Experiments GUI used by the participants

59 Rescoring Experiments Recognition Grammar: branching factor of ,220 unique callsigns sentence perplexity of 3.9E+09

60 Rescoring Experiments Recognition Grammar: branching factor of ,220 unique callsigns sentence perplexity of 3.9E+09 Used Constraints: callsign speed altitude

Rescoring Experiments Recognition Grammar: branching factor of 7.68 244,220 unique callsigns sentence perplexity of 3.

61 Rescoring Experiments Recognition Grammar: branching factor of ,220 unique callsigns sentence perplexity of 3.9E+09 Used Constraints: callsign speed altitude Rescoring WER SER MRR None (baseline) Callsign Callsign, Spd, Alt Oracle

62 Section V Conclusions

63 Conclusions We have shown how dynamic context knowledge can profitably be used for rescoring ASR hypotheses

64 Conclusions We have shown how dynamic context knowledge can profitably be used for rescoring ASR hypotheses This is of particular interest in scenarios where explicit context information is available, such as ATC: radar-derived aircraft state vectors video games virtual reality

65 Thank you very much for your attention!

66 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading>

67 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading> Can easily be extended to class-based N-grams [callsign] [cmd_turn] [direction] heading [heading].

68 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading> Can easily be extended to class-based N-grams [callsign] [cmd_turn] [direction] heading [heading].

69 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading> Can easily be extended to class-based N-grams [callsign] [cmd_turn] [direction] heading [heading].

70 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading> Can easily be extended to class-based N-grams [callsign] [cmd_turn] [direction] heading [heading].

71 The Air Traffic Control Task XML-tags for easy extraction of semantic frames: <callsign> delta four three niner </callsign> <cmd_turn> turn </cmd_turn> <direction> right </direction> heading <heading> two two zero </heading> Can easily be extended to class-based N-grams [callsign] [cmd_turn] [direction] heading [heading].

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph