This short paper used self-collected web data and class-dependent LM mixture to get a better LM for conversational speech. Web search engines were used to search for conversation-like websites by setting the queries to conversational ngram and topic words. In class-dependent mixture, the mixing weight depends on the class of the previous word. The collected web data plus in-domain data outperforms a large size of news data plus in-domain data.
Reviewed by
zzb3886
as

- 2008-09-06 00:36:27
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams. 1