Predicting the Components of German Nominal Compounds

Marco Baroni, Johannes Matiasek, Harald Trost

Compounding is a common cross-linguistic mean to form complex words. In German and many other languages, compounds are commonly written as single orthographic strings. Because compounding is a productive process, a considerable amount of compounds cannot, even in principle, be listed in a lexicon. This poses a challenge to word prediction systems (such as those embedded in most current augmentative and alternative communication systems) that aim to predict what the next word a user wants to type will be on the basis of corpus-extracted n-gram counts. We present a solution to this problem based on the idea that compounds should not be predicted as units, but as the concatenation of their components. In particular, we designed a word prediction system in which the prediction of German two-element nominal compounds (by far the most common compound type in German) is split into the prediction of the modifier (left element) and the prediction of the head (right element). Both components are predicted on the basis of uni- and bigram statistics collected treating modifiers and heads as independent units, and on the basis of the type frequency of nouns in head and modifier context in the training corpus. We show that our system brings a dramatic improvement in keystroke saving rate over a word prediction scheme in which compounds are treated as units (from 50.8% to 58.1% for a test set of compound targets). In particular, our results indicate that the type frequency of nouns in head/modifier context in the training corpus is a very good predictor of which nouns will occur in head/modifier context in new text.

Keywords: Natural Language Processing, Human Language Technology

Citation: Marco Baroni, Johannes Matiasek, Harald Trost: Predicting the Components of German Nominal Compounds. In F. van Harmelen (ed.): ECAI2002, Proceedings of the 15th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2002, pp.470-474.

