The Linguatec Corpus
|
Background In recent years, philologists and computer linguists have come to the conclusion that not only normative language rules but also aspects of language usage must be a key aim of describing language. To this end, they started gathering large text corpora, making them available for linguistic work. This aim is supported by technological developments which have made it possible to store and administer large quantities of text on computers. Corpus-based computer linguistics aims to model the actual usage of language, to cover frequently occurring phenomena, and to ensure that the rules are not overlooked in the face of all the exceptions. This not only increases the quality of language programs but also makes their development more effective. The Linguatec corpus
To ensure wide coverage, Linguatec has built the corpus from a wide variety of sources:
The corpus data is processed by advanced technologies developed by Linguatec (language detection, sentence segmentation, subject area recognition etc.); it is used for reference, for example, in dictionary work or for creating grammars (frequent constructions etc.). Using these technologies, the corpus is sorted by languages and broken down into smaller units (usually sentences); currently the corpus contains approx. 50 million such units. For reasonable language representation the corpus must have a significant size. The Linguatec corpus consists of more than 1.75 billion word forms (status: February 2007) in English, German, French, Spanish, Italian and Portuguese. It is continually being expanded. Benefits (example: dictionary work) Previous generations of scientists have handed down many terms which are found in dictionaries but not necessarily in everyday speech.
Linguatec uses multi-lingual corpora for dictionary work to ensure that
At the moment, however, the greatest benefit is that the context of a term, as can only be found in a large corpus, is considered in the selection of translation alternatives. This technology, for which an international patent has been filed, has been developed by Linguatec under the name "neural transfer". |


In this context, Linguatec has built its own text corpus. It contains sample text in many languages but focuses on German, English and French.

with an interactive language coach using our 
