Breaking the Language Barrier

Text-to-speech technology

What is speech synthesis?

Speech synthesis is the artificial reproduction of natural speech. Spoken texts are generated by a computer. Rather than being played from a previously recorded body of texts, each sentence is individually generated.

How do you put the voice in the program?

The first question is, what’s “synthetic” about the speech synthesis? Linguatec’s Voice Reader is based on detailed voice recordings by trained speakers. So the voices are not artificial!

This recorded material is then divided into small units. These can be individual phonemes, e.g. A and E, or diphthongs, such as EA or IE, and even full syllables. This is important, because depending on the environment, the same letter can sound different. For example, the letter E appears twice in the word “sever”, but is pronounced differently each time.

The units are then chained together with very complex algorithms into a new, flowing audio text. This is the actual synthesis, or more precisely “composition”. This requires a specific understanding of text so that the result sounds as natural as possible. What is easy here is the rule that the voice should rise with a question mark, and fall with a full stop. However, the program must know where the subject is in the sentence so that there can be a natural speech melody (prosody) through the sentence, because this word carries a strong accent. These analysis processes are considerably more complex of course – the program has it as tough as any Latin student!

What can speech synthesis be used for?

It has numerous applications. It is used where no text display or only inadequate text display is available, for example, for text messages, on the phone or in dialog systems. Speech synthesis is also helpful in situations where the eyes are occupied with other tasks, for example operating a motor vehicle. It is used in automobile navigation systems, for example. Speech synthesis is very useful for the blind, who can have texts from the Internet or from their computer read to them. People with speech impediments can use it to communicate.

What approaches are there to speech synthesis?

There are different approaches to speech synthesis, for example: text-to-speech and concept-to-speech synthesis.

  • Concept-to-speech synthesis involves a generation component that generates a textual expression from semantic, pragmatic and discourse knowledge. The speech signal can then be generated from this expression.
    Concept-to-speech synthesis can be used in dialog systems, for example. But anywhere where the input is already in textual form, text-to-speech synthesis will be used.

  • In text-to-speech synthesis, the text to be spoken in provided, it is not generated by the system. It must however be analyzed and interpreted in order to convey the proper pronunciation and emphasis (e.g. to produce a question instead of a statement).



How is a text-to-speech system structured?

Text-to-speech synthesis takes place in several steps. The TTS system gets a text as input, which it first must analyze and then transform into a phonetic description. Then in a further step it generates the prosody. From the information now available, it can produce a speech signal.


  1. Text analyses consists of several steps:
    • First the text is segmented into tokens. The token-to-word conversion creates the orthographic form of the token. For the token "Nr." the orthographic form "Nummer" is formed by expansion, the token "12" gets the orthographic form "twelve" and "1997" is transformed to "nineteen ninety seven". This expansion is sometimes not so easy, as can be seen with the example of the number "1": It has to be expanded differently depending on what it denotes: in a street address to "eins", in "1 Kilogramm" to "ein"; in the expression "1 Katze jagt 1 Hund", first to "eine" and then to "einen".

    • In the process of text analysis the context of the token is also analyzed: In the case of abbreviations like "tgl.", one does not know without context analysis whether it should be expanded to "täglich", "tägliche", "täglichem", "täglichen", "täglicher" or "tägliches". The context analysis is also required in German to clarify stress patterns: for example, "modern" and "modern", which cannot be differentiated from their spelling.
  2. After the text analysis has been completed, pronunciation rules can be applied.
    Letters cannot be transformed 1:1 into phonemes because correspondence is not always parallel. In certain environments, a single letter can correspond to either no phoneme (for example, "h" in "geht") or several phonemes (""x in "Fixkosten"). In addition, several letters can correspond to a single phoneme ("ch" in "ich"). Letters can be pronounced differently in different environments ("s" in "Stadt" vs. in
    "Sachen"). And the same phoneme can correspond to different letters ("Rat" vs. "Rad").

    There are two strategies to determine pronunciation:
    • In dictionary-based solutions with morphological components, as many morphemes (words) as possible are stored in a dictionary. Full forms are generated by means of inflection, derivation and composition rules. Alternatively, a full form dictionary is used in which all possible word forms are stored.
      Pronunciation rules determine the pronunciation of words not found in the dictionary.
    • In a rule-based solution, pronunciation rules are generated from the phonological knowledge of dictionaries. Only words whose pronunciation is a complete exception are included in the dictionary.
    The two approaches differ significantly in the size of their dictionaries; that of the dictionary-based solution is many times larger than the rule-based solution's dictionary of exceptions. However, dictionary-based solutions can be more exact than rule-based solutions if they have a large enough phonetic dictionary available.

  3. After the pronunciation has been determined, the prosody is generated.
    The degree of naturalness of a TTS system is dependent on prosodic factors like intonation modeling (phrasing and accentuation), amplitude modeling and duration modeling (including the duration of the sound and the duration of pauses, which determine the length of the syllables and the tempo of the speech).

    Prosodic characteristics have various functions: they can make the focus of a sentence clear, i.e. a phrase is emphasized as being important or new. In addition they are responsible for the segmentation of a sentence. They can create connections between sentences or parts of sentences and determine the sentence mode (statement or question). Syntactic information is especially important for prosody generation. For most sentences the prosody can be calculated by means of knowledge of the syntactic structure of a sentence.

    For some sentences, on the other hand, semantic and pragmatic information is important: sentences whose syntactic structure is ambiguous often take on a new meaning depending on which component is emphasized. Marking the focus is especially important in negative sentences: the components that the negation refers to need to be highlighted by means of emphasis (e.g. in Maria didn't go to Hamburg by car" as opposed to "Maria didn't go to Hamburg by car".) Semantic and pragmatic knowledge is available in few TTS systems, however.

  4. The data from the speech processing module is passed to the signal processing module.This is where the actual synthesis of the audio signal happens. In concatenate synthesis the selection and linking of speech segments take place. For individual sounds the best options (where several appropriate options are available) are selected from a database and concatenated.


SmartFind

Bar

News

11 Dec 2008
Wishing you a Merry Christmas!

It is that time of the year again: Christmas is almost here!

And to get you into the right holiday spirit, we have prepared some special Linguatec Christmas greetings for you. Enjoy, and remember to relax!

11 Dec 2008
LEO now with Voice Reader!

The famous Online Dictionary LEO has now upgraded its comprehensive dictionary services with an interactive language coach using our Voice Reader! You can listen to reading exercises, translations and dictated texts in the language of the respective text. The pronunciation function is available for all languages offered by LEO.
We congratulate LEO on their choice of the groundbreaking service and hope that all users will enjoy learning languages with LEO!

30 Oct 2008
Voice Reader Web – now with music!

The new version of Voice Reader Web offers website visitors the possibility to listen to the text accompanied by background music!

This is how it works: In the popup window with the player, click on the button with the musical note. In the menu that is then displayed, you can set music genre, volume etc. All according to your mood, you can choose from relaxing or rhythmical sounds. Try it out on the left side of this page!

23 Oct 2008
A feather in the cap for the Linguatec website

Each year, the renowned Web-Adressbuch für Deutschland, a printed directory of German web addresses, presents the most important websites each year. Naturally, Linguatec is one of them, with its free online services, the Personal Translator Demo, the Voice Reader Demo and the online dictionary LinguaDict, as well as comprehensive information about language technology.
We feel honoured by the inclusion, and are delighted that you may benefit from our efforts!

More News...

Home | Shop | Products | Test for Free | Resellers | Press | Services | About Us
Contact Us | Imprint | General Terms and Conditions | Privacy | Feed RSS Feed
Copyright © 2009 Linguatec. All rights reserved.