bsilverthorn has 1 private note and 0 public notes for this article.
If you are bsilverthorn then you can log in to see the private note.
Word rates in text vary according to global factors such as genre, topic, author, andexpected readership (Church and Gale 1995). Models that summarize such global factorsin text or at the document level, are called ‘text models.’ A finite mixture of Dirichletdistribution (Dirichlet Mixture or DM for short) was investigated as a new text model.When parameters of a multinomial are drawn from a DM, the compound for discreteoutcomes is a finite mixture of the Dirichlet-multinomial. A Dirichlet multinomial can beregarded as a multivariate version of the Poisson mixture, a reliable univariate model forglobal factors (Church and Gale 1995). In the present paper, the DM and its compoundsare introduced, with parameter estimation methods derived from Minka’s fixed-pointmethods (Minka 2003) and the EM algorithm. The method can estimate a considerablenumber of parameters of a large DM, i.e., a few hundred thousand parameters. Afterdiscussion of the relationships within the DM — probabilistic latent semantic analysis(PLSA) (Hofmann 1999), the mixture of unigrams (Nigam et al. 2000), and latentDirichlet allocation (LDA) (Blei et al. 2001, 2003) —the products of statistical languagemodeling applications are discussed and their performance in perplexity compared. TheDM model achieves the lowest perplexity level despite its unitopic nature.