António Ribeiro, Gabriel Pereira Lopes, João Tiago Mexia
About 15% of the vocabulary found in large texts of the Official Journal of the European Communities is the same in its various official languages. If we take, for example, the Portuguese-Spanish pair, the rate rises to more than 30% since these are similar languages and, for the opposite reason, it drops to about 10% for the pair Portuguese-German. This is a wealthy source of information for parallel texts alignment that should not be left unused. Bearing this in mind, this paper describes a language independent method that makes use of those words, which are homograph for a pair of languages, in order to align parallel texts. This work was originally inspired and extends work done by Pascale Fung & Kathleen McKeown, and Melamed. In order to filter out words that may cause misalignment, we use confidence bands of linear regression analysis instead of statistically unsupported heuristics. We do not get 100% text alignment precision mostly due to term order policies in the different languages. The parallel segments obtained have an average length of four words for case law texts.
Keywords: Natural Language Processing, Text Mining, Knowledge Acquisition, Machine Learning
Citation: António Ribeiro, Gabriel Pereira Lopes, João Tiago Mexia: Linear Regression Based Alignment of Parallel Texts Using Homograph Words. In W.Horn (ed.): ECAI2000, Proceedings of the 14th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2000, pp.446-450.