Please help support CiteULike by taking part in our marketing survey.
CiteULike is a free online bibliography manager. Register and you can start organising your references online.

Aligning sentences in parallel corpora Export

(1991), pp. 169-176.

Citation Format

[Posts]

View FullText article


X Reviews [Write a review of this article]

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Posting History

X Abstract

In this paper we describe a statistical tech-nique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our da.ta, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment com-putation is fast and therefore practical for appli-cation to very large collections of text. We have used this technique to align several million sen-tences in the English-French Hans~trd corpora and have achieved an accuracy in excess of 99 % in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96 % and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.


X BibTeX record

X RIS record


Privacy Statement | Terms & Conditions
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.