A translation lexicon can be learned from two monolingual documents. This sounds surprising. The results presented in this paper are also surprising. They did learn a lexicon based on documents that are not parallel. They created a canonical space, which connects the source space and target space, and the elements in it is considered as latent concept. Matchings are established by those latent concepts. However, in unparalleled documents, the labeling of the latent concepts is not available. The authors treated them as missing data and used EM algorithm to learn the mapping.
The features used were editdist, combined by orthographic, context, and combining features. I don't quite understand that. Need to look into more of them.
Reviewed by
zzb3886
as

- 2008-07-03 20:49:38