Breaking the Language Barrier

The Linguatec Corpus

Background

In recent years, philologists and computer linguists have come to the conclusion that not only normative language rules but also aspects of language usage must be a key aim of describing language. To this end, they started gathering large text corpora, making them available for linguistic work.

This aim is supported by technological developments which have made it possible to store and administer large quantities of text on computers.

Corpus-based computer linguistics aims to model the actual usage of language, to cover frequently occurring phenomena, and to ensure that the rules are not overlooked in the face of all the exceptions. This not only increases the quality of language programs but also makes their development more effective.

The Linguatec corpus

In this context, Linguatec has built its own text corpus. It contains sample text in many languages but focuses on German, English and French.

To ensure wide coverage, Linguatec has built the corpus from a wide variety of sources:

  • General texts, such as can be found on the Web; for this purpose a special agreement was made with Google.
  • News texts, as distributed by newspapers and news agencies.
  • Special technical texts from different specialist areas, such as are processed by Linguatec customers, for example. Corpus work is particularly important due to the large and ever-increasing number of technical terms in specialist areas, e.g. automotive and mechanical engineering, medicine, etc.

The corpus data is processed by advanced technologies developed by Linguatec (language detection, sentence segmentation, subject area recognition etc.); it is used for reference, for example, in dictionary work or for creating grammars (frequent constructions etc.). Using these technologies, the corpus is sorted by languages and broken down into smaller units (usually sentences); currently the corpus contains approx. 50 million such units.

For reasonable language representation the corpus must have a significant size. The Linguatec corpus consists of more than 1.75 billion word forms (status: February 2007) in English, German, French, Spanish, Italian and Portuguese. It is continually being expanded.

Benefits (example: dictionary work)

Previous generations of scientists have handed down many terms which are found in dictionaries but not necessarily in everyday speech.

  • New terms (for example, "elk test") often took years to find their way into the dictionaries. But once they are in there, they develop remarkable survivability.
  • Multi-lingual dictionaries and even well-known glossaries often give translations for which not a single occurrence can be found in the billions of Web pages managed by Google. These are fabrications or wishes of authors, but no foreign-speaking partner will understand what the translator means if he/she uses such a translation.

Linguatec uses multi-lingual corpora for dictionary work to ensure that

  • frequently used terms are really represented in the dictionaries; this increases the coverage of the dictionaries and thus the quality of the analyzing components;
  • translations are really used; the translations are selected according to usage frequency to avoid giving special cases more importance than they deserve.

At the moment, however, the greatest benefit is that the context of a term, as can only be found in a large corpus, is considered in the selection of translation alternatives. This technology, for which an international patent has been filed, has been developed by Linguatec under the name "neural transfer".


SmartFind

Bar

News

11 Dec 2008
Wishing you a Merry Christmas!

It is that time of the year again: Christmas is almost here!

And to get you into the right holiday spirit, we have prepared some special Linguatec Christmas greetings for you. Enjoy, and remember to relax!

11 Dec 2008
LEO now with Voice Reader!

The famous Online Dictionary LEO has now upgraded its comprehensive dictionary services with an interactive language coach using our Voice Reader! You can listen to reading exercises, translations and dictated texts in the language of the respective text. The pronunciation function is available for all languages offered by LEO.
We congratulate LEO on their choice of the groundbreaking service and hope that all users will enjoy learning languages with LEO!

30 Oct 2008
Voice Reader Web – now with music!

The new version of Voice Reader Web offers website visitors the possibility to listen to the text accompanied by background music!

This is how it works: In the popup window with the player, click on the button with the musical note. In the menu that is then displayed, you can set music genre, volume etc. All according to your mood, you can choose from relaxing or rhythmical sounds. Try it out on the left side of this page!

23 Oct 2008
A feather in the cap for the Linguatec website

Each year, the renowned Web-Adressbuch für Deutschland, a printed directory of German web addresses, presents the most important websites each year. Naturally, Linguatec is one of them, with its free online services, the Personal Translator Demo, the Voice Reader Demo and the online dictionary LinguaDict, as well as comprehensive information about language technology.
We feel honoured by the inclusion, and are delighted that you may benefit from our efforts!

More News...

Home | Shop | Products | Test for Free | Resellers | Press | Services | About Us
Contact Us | Imprint | General Terms and Conditions | Privacy | Feed RSS Feed
Copyright © 2009 Linguatec. All rights reserved.