Text-to-speech is a process through which text is rendered as digital audio and then "spoken." Most text-to-speech engines can be categorized by the method that they use to translate phonemes into audible sound.
Subword segments are acquired by recording many hours of a human voice and painstakingly identifying the beginning and ending of phonemes. Although this technique can produce a more realistic voice, it takes a considerable amount of work to create a new voice and the voice is not localizable because the phonemes are specific to the speaker's language.
Text-to-speech should be used to audibly communicate information to the user, when digital audio recordings are inadequate. Generally, text-to-speech is better than audio recordings when:
Text-to-speech also offers a number of benefits. In general, text-to-speech is most useful for short phrases or for situations when prerecording is not practical. Text-to-speech has the following practical uses:
The specific use of text-to-speech will depend on the application. Here are some sample ideas and their uses:
Games and Edutainment
Text-to-speech is useful in games and edutainment to allow the characters in the application to "talk" to the user instead of displaying speech balloons. Of course, it's also possible to have recordings of the speech. An application would use text-to-speech instead of recordings in the following cases:
Look in the Text-To-Speech for Telephony article for a full description of telephony.
A speech application requires certain hardware and software on the user's computer to run. Not all computers have the memory, speed, or speakers required to support speech, so it is a good idea to design the application so that speech is optional.
These hardware and software requirements should be considered when designing a speech application:
Text-to-Speech Voice Quality
Most text-to-speech engines can render individual words successfully. However, as soon as the engine speaks a sentence, it is easy to identify the voice as synthesized because it lacks human prosody -- i.e., the inflection, accent, and timing of speech. For this reason, most text-to-speech voices are difficult to listen to and require concentration to understand, especially for more than a few words at a time.
Some engines allow an application to define text-to-speech segments with human prosody attached, making the synthesized voice much clearer. The engine provides this capability by prerecording a human voice and allowing the application developer to transfer its intonation and speed to the text being spoken.
In effect, this acts as a highly effective voice compression algorithm. Although text with prosody attached requires more storage than ASCII text (1K per minute compared to a few hundred bytes per minute), it requires considerably less storage than prerecorded speech, which requires at least 30K per minute. These factors also influence the quality of a synthesized voice:
If an engine mispronounces a word, the only way that the user can change the pronunciation is by entering either the phonemes, which is not an easy task, or by choosing a series of "sound-alike" words that combine to make the correct pronunciation.
Creating and Localizing Text-to-Speech Voices
Creating a new voice for an engine that uses synthesis can be done relatively quickly by altering a few parameters of an existing voice. However, although the pitch and timbre of the new voice are different, it uses the same speaking style and prosody rules as the existing voice.
Creating a new voice for a text-to-speech engine that uses diphone concatenation can take a considerable amount of work, because the diphones must be acquired by recording a human voice and identifying the beginning and ending of phonemes, which are specific to the speaker's language.
Whether a text-to-speech engine uses synthesis or diphone concatenation, the work of localizing an engine for a new language requires a skilled linguist to design pronunciation and prosody rules and reprogram the engine to simulate the sound of the language's phonemes. In addition, diphone-concatenation systems require a new voice to be constructed for the new language. As a consequence, most engines support only five to ten major languages.
Using Text-to-Speech for Short Phrases
An application should use text-to-speech only for short phrases or notifications, not for reading long passages of text. Because listening to a synthesized voice read more than a few sentences requires more concentration, a user can become irritated.
Presenting Important Information Visually
An application should communicate critical information visually as well as audibly, and it should not rely solely on text-to-speech to communicate important information. The user can miss spoken messages for a variety of reasons, such as not having speakers or headphones attached to the computer, being distracted or out of earshot when the application speaks, or the user may simply have turned off text-to-speech.
Avoiding a Mix of Text-to-Speech and Recorded Voice
The synthesized voice provided by even the best text-to-speech engine is noticeably different from that provided by a digital-audio recording. Mixing the two in the same utterance can be disturbing to the user (and usually makes the text-to-speech voice sound worse by comparison).
For example, to have an application speak "The number is 56,738," you should not prerecord "The number is" and use text-to-speech to speak the numbers. You should either prerecord everything or use text-to-speech for everything.
Making Text-to-Speech Optional
An application should always allow the user to turn off text-to-speech. Some users work in environments in which a talking computer may distract coworkers or in which privacy may be important. Also, some users may simply dislike the sound of a synthesized voice.
Where the Engine Comes From
Of course, for text-to-speech to work on an end user's PC the system must have a text-to-speech engine installed on it. The application has two choices:
The application can come bundled with a text-to-speech engine and install it itself. This guarantees that text-to-speech will be installed and also guarantees a certain level of quality from the text-to-speech. However, if an application does this, royalties will need to be paid to the engine vendor.
Alternatively, an application can assume that the text-to-speech engine is already on the PC or that the user will purchase one if they wish to use text-to-speech. The user may already have text-to-speech because many PCs and sound cards will come bundled with an engine, or, the user may have purchased another application that included an engine. If the user has no text-to-speech engine installed then the application can tell the user that they need to purchase a text-to-speech engine and install it. Several engine vendors offer retail versions of their engines.