![]() |
CiteULike | ![]() |
Phanix's CiteULike | ![]() |
![]() |
|
![]() |
Register | ![]() |
Log in | ![]() |
Unknown Word Extraction for Chinese Documentsby: Keh-Jiann Chen, Wei-Yun Ma
|
Reviews
[Write a review of this article]
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
Posting History
AbstractThere is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknow words. Most previous works focus their attention only on the resolution of ambiguous segmentation. The problem of unknown word identification is considered more difficult and needs further investigation. Convertionally unknown words were extracted by statistical methods for statistical methods are simple and efficient. Howevere the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall. Because character strings with statistical significance might be phrases or partical phrases instead of words and low frequency new words are hardly identifiable by statistic methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown word in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented wich online identifies new words, including low frequency new words, with high precision and high recall rates.
BibTeX record
RIS record