| |
In HLT-NAACL 2003: Main Proceedings (# 2003)
posted to fst paraphrasing
by abhaga
on 2007-05-27 01:59:27
Abstract
We describe a syntax-based algorithm that automatically builds Finite State Automata (word lattices) from semantically equivalent translation sets. These FSAs are good representations of paraphrases. They can be used to extract lexical and syntactic paraphrase pairs and to generate new, unseen sentences that express the same meaning as the sentences in the input sets. Our FSAs can also predict the correctness of alternative semantic renderings, which may be used to evaluate the quality of... ...
|
| |
posted to mt-eval paraphrasing
by abhaga
on 2007-05-27 01:55:09
Abstract
In this paper we present a novel method for deriving paraphrases during automatic MT evaluation using only the source and reference texts, which are necessary for the evaluation, and word and phrase alignment software. Using target language paraphrases produced through word and phrase alignment a number of alternative reference sentences are constructed automatically for each candidate translation. ...
|
| |
No. {LAMP}-{TR}-126,{CS}-{TR}-4755,{UMIACS}-{TR}-2005-58. (J 2005)
posted to mt-eval
by abhaga
on 2007-01-20 03:05:56
Abstract
We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Error Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We also compute a human-targeted TER (or HTER), where the minimum TER of the translation is computed against a human ?targeted reference? that preserves the meaning (provided ...
|
| |
Abstract
We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu's correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores. ...
|
| |
Abstract
Most state-of-the-art evaluation measures for machine translation assign high costs to movements of word blocks. In many cases though such movements still result in correct or almost correct sentences. In this paper, we will present a new evaluation measure which explicitly models block reordering as an edit operation. ...
|
| |
posted to mt-eval recall
by abhaga
on 2006-10-19 04:31:22
Abstract
Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show... ...
|
| |
In Machine Translation Summit IX (September 2003)
Abstract
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and the F-measure. The F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, the standard measures have an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The relevant software is publicly... ...
|
| |
In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (October 2005), pp. 740-747
posted to blanc mt-eval
by abhaga
on 2006-10-19 04:27:20
|
| |
In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (June 2005), pp. 65-72
|
| |
In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (June 2005), pp. 57-64
posted to confidence_interval mt-eval
by abhaga
on 2006-10-19 04:23:59
|
| |
In LREC 2004
posted to mt-eval
by abhaga
on 2006-10-19 04:04:10
|
| |
In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume (July 2004), pp. 605-612
posted to mt-eval rouge
by abhaga
on 2006-10-19 03:59:39
|
| |
In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference (June 2006), pp. 455-462
|
| |
In International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2004), Baltimore, MD USA, October 4-6, 2004
posted to confidence_interval mt-eval
by abhaga
on 2006-10-19 03:54:53
|
| |
In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL'04), Main Volume (July 2004), pp. 621-628
posted to bleu mt-eval
by abhaga
on 2006-10-19 03:44:13
|
| |
In Proceedings of Coling 2004 (Aug 2004), pp. 106-112
|
| |
In Proceedings of Coling 2004 (Aug 2004), pp. 501-507
|
| |
In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions (July 2006), pp. 539-546
|
| |
posted to general mt-eval
by abhaga
on 2006-10-19 02:48:43
|
| |
posted to mt-eval paraphrasing
by abhaga
on 2006-10-19 02:23:59
Note (first note only)
shows comparison with meteor results
|
| |
In European Association for Machine Translation
posted to eamt2005 mt-eval sentence_level
by abhaga
on 2006-10-19 02:07:23
|
| |
Abstract
Automatic evaluation of machine translation, based on computing n-gram similarity between system output and human reference translations, has revolutionized the development of MT systems. We explore the use of syntactic information, including constituent labels and head-modifier dependencies, in computing similarity between output and reference. Our results show that adding syntactic information to the evaluation metric improves both sentence-level and corpus-level correlation with... ...
|
| |
(2004)
Abstract
The problem of evaluating machine translation (MT) systems is more challenging than it may first appear, as diverse translations can often be considered equally correct. The task is even more difficult when practical circumstances require that evaluation be done automatically over short texts, for instance, during incremental system development and error analysis. ...
|
| |
posted to mt-eval
by abhaga
on 2006-10-19 01:56:40
Abstract
This paper experimentally compares two automatic evaluators, RED and BLEU, to determine how close the evaluation results of each automatic evaluator are to average evaluation results by human evaluators, following the ATR standard of MT evaluation. This paper gives several cautionary remarks intended to prevent MT developers from drawing misleading conclusions when using the automatic evaluators. In addition, this paper reports a way of using the automatic evaluators so that their results... ...
|
| |
(2001)
Abstract
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. ...
|