Home > Company > Articles > The Background of Text To Speech

The Background of Text To Speech

What is Text-to-Speech?

Text-to-speech is a process through which text is rendered as digital audio and then "spoken." Most text-to-speech engines can be categorized by the method that they use to translate phonemes into audible sound.

  • Concatenated Word. Although Concatenated Word systems are not really synthesizers, they are one of the most commonly used text-to-speech systems around. In a concatenated word engine, the application designer provides recordings for phrases and individual words. The engine pastes the recordings together to speak out a sentence or phrase. If you use voice-mail then you've heard one of these engines speaking, "[You have] [three] [new messages]." The engine has recordings for "You have", all of the digits, and "new messages".
  • Synthesis. A text-to-speech engine that uses synthesis generates sounds similar to those created by the human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position. The voice produced by synthesis technology tends to sound less human than a voice produced by diphone concatenation, but it is possible to obtain different qualities of voice by changing a few parameters.
  • Subword Concatenation. A text-to-speech engine that uses subword concatenation links short digital-audio segments together and performs intersegment smoothing to produce a continuous sound. In diphone concatenation, for example, each segment consists of two phonemes, one that leads into the sound and one that finishes the sound. Thus, the word "hello" consists of the phonemes, h eh l ? and the corresponding subword segments are silence-h h-eh eh-l l-??silence.

Subword segments are acquired by recording many hours of a human voice and painstakingly identifying the beginning and ending of phonemes. Although this technique can produce a more realistic voice, it takes a considerable amount of work to create a new voice and the voice is not localizable because the phonemes are specific to the speaker's language.

Why Use Text-to-Speech?

Text-to-speech should be used to audibly communicate information to the user, when digital audio recordings are inadequate. Generally, text-to-speech is better than audio recordings when:

  1. Audio recordings are too large to store on disk or expensive to record.
  2. Audio recording is impossible because the application doesn't know ahead of time what it will speak.

Text-to-speech also offers a number of benefits. In general, text-to-speech is most useful for short phrases or for situations when prerecording is not practical. Text-to-speech has the following practical uses:

  • Reading dynamic text. Text-to-speech is useful for phrases that vary too much to record and store using all possible alternatives. For example, speaking the time is a good use for text-to-speech, because the effort and storage involved in concatenating all possible times is manageable.
  • Proofreading. Audible proofreading of text and numbers helps the user catch typing errors missed by visual proofreading.
  • Conserving storage space. Text-to-speech is useful for phrases that would occupy too much storage space if they were prerecorded in digital-audio format.
  • Notifying the user of events. Text-to-speech works well for informational messages. For example, to inform the user that a print job is complete, an application could say "Printing complete" rather than displaying a message box and requiring the user to click OK. (This should be used for noncritical notifications in case the user turns the computer's sound off or is out of hearing range.)
  • Providing audible feedback. Text-to-speech can provide audible feedback when visual feedback is inadequate or impossible. For example, the user's eyes might be busy with another task, such as transcribing data from a paper document. Users that have low vision may rely on text-to-speech as their sole means of feedback from the computer.

Potential Uses By Application Category

The specific use of text-to-speech will depend on the application. Here are some sample ideas and their uses:

Games and Edutainment

Text-to-speech is useful in games and edutainment to allow the characters in the application to "talk" to the user instead of displaying speech balloons. Of course, it's also possible to have recordings of the speech. An application would use text-to-speech instead of recordings in the following cases:

  • It's always possible to use concatenated word/phrase text-to-speech to replace recorded sentences. The application designer can easily pass the desired sentence strings to the text-to-speech engine.
  • Synthesized text-to-speech inevitably sounds unnatural and weird. However, it's very good for character voices that are supposed to be robots, aliens, or maybe even foreigners.
  • Of course, if the application cannot afford to have recordings of all the possible dialogs or if the dialogs cannot be recorded ahead of time, then text-to-speech is the only alternative.


Look in the Text-To-Speech for Telephony article for a full description of telephony.

Hardware and Software Requirements

A speech application requires certain hardware and software on the user's computer to run. Not all computers have the memory, speed, or speakers required to support speech, so it is a good idea to design the application so that speech is optional.

These hardware and software requirements should be considered when designing a speech application:

  • Processor speed. Text-to-speech engines currently on the market typically require a 486/33 (DX or SX) or faster processor.
  • Memory. On the average, text-to-speech uses about 1 MB of RAM.
  • Sound card. Almost any sound card will work for speech recognition and text-to-speech, including Sound Blaster? Media Vision? ESS Technology, cards that are compatible with the Microsoft?Windows Sound System, and the audio hardware built into multimedia computers.
  • Speakers. The user can choose between wearing headphones or using freestanding speakers. Headphones are useful in office cubicles. Some companies manufacture a combination headphone and microphone that can also be used for telephone conversations.
  • Operating system. The Microsoft Speech application programming interface (API) requires either Windows 95 or Windows NT version 4.0.
  • Text-to-speech engine. Text-to-speech software must be installed on the user's system. Many new audio-enabled computers and sound cards are bundled with speech recognition and text-to-speech engines. As an alternative, many engine vendors offer retail packages for speech recognition or text-to-speech, and some license copies of their engines.


Text-to-Speech Voice Quality

Most text-to-speech engines can render individual words successfully. However, as soon as the engine speaks a sentence, it is easy to identify the voice as synthesized because it lacks human prosody -- i.e., the inflection, accent, and timing of speech. For this reason, most text-to-speech voices are difficult to listen to and require concentration to understand, especially for more than a few words at a time.

Some engines allow an application to define text-to-speech segments with human prosody attached, making the synthesized voice much clearer. The engine provides this capability by prerecording a human voice and allowing the application developer to transfer its intonation and speed to the text being spoken.

In effect, this acts as a highly effective voice compression algorithm. Although text with prosody attached requires more storage than ASCII text (1K per minute compared to a few hundred bytes per minute), it requires considerably less storage than prerecorded speech, which requires at least 30K per minute. These factors also influence the quality of a synthesized voice:

  • Emotion. Although many text-to-speech engines can parse and interpret punctuation, such as periods, commas, exclamation points, and questions, none of the engines that are currently available can render the sound of human emotion.
  • Mispronunciation. Text-to-speech engines use a set of pronunciation rules to translate text into phonemes. This is fairly easy for languages with phonetic alphabets, but it is very difficult for the English language, especially if last names are to be pronounced correctly. (Pronunciation rules seldom fail on common words, but they almost always fail on names that are unusual or of non-English origin.)

If an engine mispronounces a word, the only way that the user can change the pronunciation is by entering either the phonemes, which is not an easy task, or by choosing a series of "sound-alike" words that combine to make the correct pronunciation.

Creating and Localizing Text-to-Speech Voices

Creating a new voice for an engine that uses synthesis can be done relatively quickly by altering a few parameters of an existing voice. However, although the pitch and timbre of the new voice are different, it uses the same speaking style and prosody rules as the existing voice.

Creating a new voice for a text-to-speech engine that uses diphone concatenation can take a considerable amount of work, because the diphones must be acquired by recording a human voice and identifying the beginning and ending of phonemes, which are specific to the speaker's language.

Whether a text-to-speech engine uses synthesis or diphone concatenation, the work of localizing an engine for a new language requires a skilled linguist to design pronunciation and prosody rules and reprogram the engine to simulate the sound of the language's phonemes. In addition, diphone-concatenation systems require a new voice to be constructed for the new language. As a consequence, most engines support only five to ten major languages.

Application Design Considerations

Using Text-to-Speech for Short Phrases

An application should use text-to-speech only for short phrases or notifications, not for reading long passages of text. Because listening to a synthesized voice read more than a few sentences requires more concentration, a user can become irritated.

Presenting Important Information Visually

An application should communicate critical information visually as well as audibly, and it should not rely solely on text-to-speech to communicate important information. The user can miss spoken messages for a variety of reasons, such as not having speakers or headphones attached to the computer, being distracted or out of earshot when the application speaks, or the user may simply have turned off text-to-speech.

Avoiding a Mix of Text-to-Speech and Recorded Voice

The synthesized voice provided by even the best text-to-speech engine is noticeably different from that provided by a digital-audio recording. Mixing the two in the same utterance can be disturbing to the user (and usually makes the text-to-speech voice sound worse by comparison).

For example, to have an application speak "The number is 56,738," you should not prerecord "The number is" and use text-to-speech to speak the numbers. You should either prerecord everything or use text-to-speech for everything.

Making Text-to-Speech Optional

An application should always allow the user to turn off text-to-speech. Some users work in environments in which a talking computer may distract coworkers or in which privacy may be important. Also, some users may simply dislike the sound of a synthesized voice.

Where the Engine Comes From

Of course, for text-to-speech to work on an end user's PC the system must have a text-to-speech engine installed on it. The application has two choices:

The application can come bundled with a text-to-speech engine and install it itself. This guarantees that text-to-speech will be installed and also guarantees a certain level of quality from the text-to-speech. However, if an application does this, royalties will need to be paid to the engine vendor.

Alternatively, an application can assume that the text-to-speech engine is already on the PC or that the user will purchase one if they wish to use text-to-speech. The user may already have text-to-speech because many PCs and sound cards will come bundled with an engine, or, the user may have purchased another application that included an engine. If the user has no text-to-speech engine installed then the application can tell the user that they need to purchase a text-to-speech engine and install it. Several engine vendors offer retail versions of their engines.


2nd Speech Center

2nd Speech Center is Award-Winning Text-To-Speech Player to converts any text into spoken words or even MP3/WAV audio files.