The language of this paper is a bit hard to understand. They proposed a classification of words depending on how the words are positioned in the prediction. Take 2-gram as an example, we estimate p(w2|w1) and p(w3|w2). The classification of w2 in the two estimations are different because the former one has w2 as the prediction, while the latter one has it as the condition. The same principle applies to higher order LM. The word clustering algorithm is based on minimizing the average KL distance between the word distribution and the centroid (class) distribution.
The author evaluated this LM on a relatively small data set, obtaining a PPL less than 20, and getting quite good improvement in ASR.
Reviewed by
zzb3886
as

- 2009-01-13 06:21:42
The authors propose a method to generate a compact, highly reliable language model for speech recognition based on the efficient classification of words. In this method, the connectedness with the words immediately before and after the word is taken to represent separate attributes, and individual classification is performed for each word. The resulting composite word class is created separately based on the distribution of words connected before or after. As a result, classification of classes is efficient and reliable. In a multiclass composite N-gram, which uses the proposed method for the variable-order N-gram to bring in chain words, the entry size is reduced to one-tenth, and the word recognition rate is higher than that of a conventional composite N-gram for particles or variable-length word arrays. © 2003 Wiley Periodicals, Inc. Syst Comp Jpn, 34(7): 108-114, 2003; Published online in Wiley InterScience (). DOI 10.1002/scj.1210