Home > Company > Articles > Text To Speech for Telephony

Text To Speech for Telephony

Telephony applications are applications that are accessed via the telephone rather than locally over the PC. A GUI application may also support telephony features, although the user interface design for the two interaction mechanisms are significantly different. Many GUI applications support telephony because of the flexibility that a long-distance connection to the PC provides.

Here are some typical telephony applications:

  • Voice Mail or Answering Machine Software. Most users are familiar with "Voice mail" or computerized answering machine software. These pieces of software allow users to call into a computer and access audio messages that have been left for them. Voice-mail and answering machine software programs are often extended to E-mail, address books, and other types of data.
  • Accessing Databases. Large numbers of telephony applications allow users to access databases such as movie listings, stock quotes, or news.
  • Call Routing. Many of the same telephony applications that provide voice-mail or database access also allows incoming calls to be routed to other phone lines. Because most contemporary call routing systems rely on DTMF (touch-tone) to rout the call they ask for an extension number, but with speech recognition this could just as easily be a name.

Why Should a GUI Application use Telephony?

It's obvious why voice-mail and movie-listings applications should be accessible by the phone. But why GUI applications? GUI applications might provide telephony functionality for the following reason:

  • Remote Access. Any application that provides a functionality that users might want to access when they are not near a computer might use telephony. An application that can control the user's VCR is a great example; After all, how often have you forgotten to set the VCR to record a show only to remember when you were at work?
  • Database. Any application that provides a database that users might want to access when they are not near a computer might use telephony. This includes E-mail systems, address books, etc.
  • Reminders and Alerts. Some applications need to remind the user to take action at specific times, such as a schedule program. These can use the telephone lines to call and remind the user no matter where the user is in the world. A scheduling application could remind all of a meeting's participants by giving them a call before the meeting started. Alternatively, a PC could call a user if something of importance had happened. Perhaps when a hard drive on a server crashes over the weekend.

Why Use Speech in Telephony?

Contemporary telephony applications use recordings to play audio to the users, and wait for DTMF (touch-tone) responses from the user. Speech technology has the following advantages over DTMF and recorded audio:

  • Speech recognition allows a much larger set of inputs than DTMF. While DTMF can only allow the users 12 choices in any state, speech recognition can provide the user with hundreds or thousands of possible responses. Hence, a DTMF system that wishes to ask, "Who do you want to call?" needs a complicated menu structure to maneuver through thousands of names. Speech recognition only has to hear the full name.
  • In a DTMF application, frequent users either have to memorize or write down an arcane DTMF key sequences so they don't have to listen to all of the prompts. Users of speech recognition systems don't need to do this. It's much easier for users to remember what words to speak than what numbers to press.
  • In some places in Europe, only 80% of all telephones are rotary and cannot produce DTMF. In the United States of America, 20% of all telephones are rotary and cannot produce DTMF.
  • Synthesized text-to-speech allows anything to be spoken, allowing E-mail and names to be spoken to the user. It is also cheaper than hiring a professional talent to record, and localizes easier.

Hardware and Software Requirements

Telephony applications use the same speech recognition engines used for Command and Control speech recognition, and the same text-to-speech engines used on the PC.

These hardware and software requirements should be considered when designing a speech application:

  • Processor speed. The speech recognition and text-to-speech engines currently on the market typically require a 486/66 or faster processor.
  • Memory. On the average, the combination of speech recognition and text-to-speech will use 2 megabytes (MB) of random-access memory (RAM) in addition to that required by the running application.
  • Telephony card. A number of telephony cards are on the market today. On the low end are cards that use FAX/MODEM chips that have been augmented to handle speech. These are included in almost every new home PC. Higher end cards include DSPs or support for multiple phone lines.
  • Operating system. The Microsoft Speech application-programming interface (API) requires either Windows 95 or Windows NT version 3.5.
  • Speech-recognition and text-to-speech engine. speech recognition and text-to-speech software must be installed on the user's system. Many new audio-enabled computers and sound cards are bundled with speech recognition and text-to-speech engines. As an alternative, many engine vendors offer retail packages for speech recognition or text-to-speech, and some license copies of their engines.


If you haven't already, you should look at the Voice Commands and Voice Text sections of the Microsoft Speech API documentation. Telephony applications use the same technologies.

Half-duplex Voice Modems

Early low-end "Voice Modems" were basically created by bolting record and playback ability onto an existing FAX/MODEM chip-set. Although this made these voice modems cheap, they were lacking some critical features to make speech recognition viable.

Low-cost voice modems are "half-duplex" and do not support "full-duplex" audio. "Half-duplex" means that an application cannot record and play audio at the same time. Because of this, an important feature, "barge-in" is not possible.

"Barge-in" allows a user to interrupt a prompt being spoken by the telephony application. With barge-in, a user can listen to the list of options and speak the one he wants when he hears it. Without barge-in the user must wait for all of the options to be read before speaking. Having to listen to a long prompt can get annoying.

Applications that expect to run on half-duplex voice modems need to make sure that they design around the problems of half-duplex audio, perhaps even using shorter prompts when barge-in is not available on the machine.

Note: As of 1998, most modems are not voice-capable, and of those that are, most aren't a good match for a telephony server:

  • They may only support a single line.
  • They may only support half-duplex audio.
  • They may take a long time to start (or stop) playing (or recording) audio.
  • They may have unsuitable audio characteristics.

Although you can use a typical voice modem for testing, we suggest that you use a high-end card for real-world products.

Speaker Independence

Speech recognizers for the local PC use are designed to work well for most speakers. Some users, because of a strange accent, dialect, or voice, will not get good accuracy. On the local PC this is not that large of a problem because the user can always train the speech recognizer to his/her voice, and if that doesn't work he/she can always use the keyboard or mouse.

If a telephony recognizer does not work well for an individual then he/she cannot easily use the application. Often times it is not possible for him/her to train the speech recognizer to his/her voice. Additionally, many cellular phones (and wired phones) have such lousy audio quality that humans can barely understand speech on them, let alone computers. Failure of speech recognition will also mean failure of the application.

Application designers need to be aware that speech recognition will not work for some users, so the users must have a back off. A back off can take several forms:

  • Often users get poor recognition because they are speaking in a manner that causes speech recognition to make mistakes. Give new users tips on how to use speech recognition, such as provide concise responses.
  • Detect that the user is having difficulty with speech recognition and provide queries that have fewer or more distinct answers to improve speech recognition accuracy.
  • Allow the user to respond with DTMF.
  • Detect that the user is having difficulty and forward the call to a real operator.

Writing a Telephony Application

Conversational UI

Ideally, a call into a computer telephony application is just like (or better than) calling a real person. Using airline reservations as an example, it would be ideal for a user to call up a phone number and be able to have the following conversation with their computer:

Computer: Hello. Acme air travel.
User: Hi, could I book a reservation.
Computer: Sure, where do you want to fly from?
User: Yeah, uh, I want to fly from Seattle to Buffalo.
Computer: What day do you want to depart from Seattle.
User: Monday
Computer: Is that Monday the twenty third?
User: Yea, and I want to arrive back on Friday.
Computer: Friday the twenty seventh?
User: Uh huh.
Computer: Okay. What time would you like to leave Seattle on Monday?
User: Ten O'clock. By the way, what's the weather supposed to be like?

The conversation simulates a human being in its ease of use and flexibility.

Because of the limited number of responses available from DTMF, and the need to explain what each number means in the particular context, today's IVR systems are more like this:

Computer: Welcome to Acme Air Travel. For departure information, press 1. For ticketing, press 2.

You've undoubtedly had to deal with this.

The technology necessary to simulate the fully conversational application is years away. It not only requires very accurate large-vocabulary speech recognition, but it needs natural language understanding, and dialogue understanding. Speech recognition is easy compared to the understanding and dialogue part.

Current speech recognition technology can provide a dialogue more like this:

Computer: Hello. Acme air travel. Do you want to inquire about today's flights or purchase a ticket.
User: I'd like to buy a ticket.
Computer: What city do you wish to depart from?
User: Uh, Seattle.
Computer: Did you say, "Seattle?"
User: Yes.
Computer: Where do you want to fly?
User: I want to fly to Buffalo.
Computer: I'm sorry. I didn't understand. Please say the city you want to fly to.
User: Buffalo

While it's not as friendly as the conversational UI, the speech UI is much better than DTMF. Notice that the user is led through the system and that his/her responses are limited to individual items, and he/she can't combine them together in a single statement. In a fully conversational system, the user can steer the conversation and combine any number of responses together. As the speech recognition, natural language, and dialogue technologies improve, the user interface will gain more of the components of a conversational UI.

The Microsoft telephony architecture is designed to facilitate user interface that current technology can handle and gradually improve to provide a conversational UI.

Breaking the dialogue into smaller parts

Because current speech technology (and natural language understanding/dialogue technology too) must work in a very limited domain, applications must lead the user through the conversation, asking for specific pieces of information. The airline ticket ordering application needs to ask for the following pieces of information:

  • Departure city
  • Destination city
  • Departure date
  • Departure time
  • Destination date
  • Destination time
  • Etc. Is it a round trip? How will you pay?

To fill in each of the fields, several questions must be asked. The "departure city" field might require:

  • Ask the user where he/she wishes to depart from. Being limited technology, the speech recognition is listening for departure cities, and not departure information, times, or dates.
  • If the user doesn't respond then ask again.
  • If the user presses DTMF then tell them they're supposed to speak.
  • If the user speaks but the recognizer doesn't understand then rephrase the question and ask again. If it doesn't work the second time then send the user to an operator.
  • If recognition occurs then verify it with the user. If it wasn't right then ask for the city again, perhaps offering alternatives. If the recognizer doesn't work several times in a row then back off to DTMF with a smaller list, or send the user to an operator.
  • If the user seems to be having a lot of troubles then send them to an operator.

As you can see, getting the departure city is actually pretty complicated. It's easy if the user speaks the right city and the computer recognizes it properly. However, recognition makes mistakes (recognizing "San Francisco" instead of "Seattle", and more often, users don't give an expected response. (User says, "Uh, I want to leave from, uh, Seattle, please. That's in Washington State.")

A large portion of the code written to get any piece of information is handling the error conditions. In the case of a recognition error, it's verifying the recognition result. In the case of a user providing an unexpected response, it's getting the user to give an expected response.

The application designer must segment any dialogue into small, controllable, sub-dialogues. Each sub-dialogue gets a specific piece of information from the user.

Application Design Considerations

Multi-Line Applications

Most telephony applications are designed to handle several phone lines coming into the same PC. Multi-line telephony applications need to be designed to handle the multiple input channels in such a way that one channel doesn't slow down or harm another channel.

The easiest multi-line application has one process running at least one thread per phone line. Because each line has its own thread, the lines are independent and (generally) one line will not cause another line to slow down. Multi-threaded lines also allow for improved performance on multi-processor CPUs. However, if one of the lines in a multi-threaded application causes a GP Fault, all of the threads will die.

The most stable multi-line telephony design is to have one process per phone line. This insures that one phone line cannot crash and pull down the other lines. It also parallelizes well. It is more difficult to code.

Design the Application for Speech Recognition

Many applications try to use speech recognition by altering the menus and structure only slightly. Rather than telling the user, "To transfer money press one, to open an account press two, etc.." they say, "To transfer money press or say one, etc." While this makes the addition of speech recognition easy it does not take full advantage of it.

An application using speech recognition should be designed from the start to use speech recognition as a primary input device. DTMF should be available, but only as a back-off for those few users that cannot get speech recognition to work.

Telephones don't have monitors, keyboards, or mice

Telephones don't have monitors, keyboards, or mice. Although this is not a limit of the speech recognition and text-to-speech technologies, it does require that application designers rely heavily on the technologies. Because of this, the technologies must be used even for purposes where they don't perform as well as monitors, keyboards, and mice. The lack of other input and output devices significantly changes the user interface. Application designers should be aware of the following effects:

  • For many uses, speech recognition is not as good of an input device as the keyboard and mouse. When the user is accessing the application over the phone he/she will be forced to use speech recognition exclusively so special user interface design should be taken so that speech recognition's weaknesses don't preclude the use of the application.
  • Applications that have a GUI continually provide visual cues about the application's state. They have title bars, text displays, and various buttons and other controls that give users a clue about what they can do. Users accessing the application over the telephone do not get this information, or if they do it is delivered to them at the much slower pace of speech. Users of a telephony application often forget the application's state, so it is helpful to remind them occasionally.
  • Speech is slower at communicating information that video, and it does not easily allow users to select which information they want detailed. This means that telephony applications need to give users trimmed down slices of information and allow the user to specify which of the pieces he/she wants more information about. For example, an E-mail application designed for a GUI will display a list of hundreds of messages. A telephony E-mail application cannot read out the titles of the hundreds of messages. It must provide user interface that allows the user to focus in on the messages he/she wishes to hear. The E-mail application might first ask the user if he/she wants to hear new messages or ones that were already read. From there it could organize messages by priority, etc.
  • In any particular state a user might have hundreds of options. A GUI can visually display all of the options on the screen, but a telephony application cannot read them all out. When users enter a new state, telephony applications should read out an abridged list of options and allow users to ask for more options or more detailed information about an option.
  • When a user types a number or word into a field on an application they can see the results. Telephony applications cannot display the results so they must provide audio feedback to indicate that they heard the correct information. Because speech recognition often makes mistakes, telephony applications must also provide an easy mechanism for users to correct the mistakes.
  • Because it is not always obvious when the computer has stopped talking or that the computer has heard the user, the application should give audio feedback. The most effective audio feedback seems to be short beeps. Play one short beep at the end of a question to indicate that the user is expected to speak, another one when a response is recognized, and a third when speech is heard but unrecognized. This not only reassures the user that he/she is being heard, but it also hints to users that they're talking to a computer since real people don't beep.

Telephony Users Often Have Different Motivations

A typical PC users has experience using the computer and the software. They either have personally purchased the software or need to use it to get their job done.

Many telephony applications (like those that provide movie listings) are used by people that have completely different motivations than PC users. Telephony users don't want to talk to a computer; they want to talk to a human to get their task completed. The telephony application is merely a substitute for the real thing, and almost always inferior to the human. However, it's an annoyance they have to live with. Because of this telephony users are less forgiving of poor user interface design and are quick to hang up.

Furthermore, telephony users are a more diverse population. They are from all age groups and education levels. Many of them don't even know that they're talking to a computer and just think it' a very dense operator on the other side of the phone. Telephony applications using speech recognition get better "accuracy" if the users know that they're talking to a computer because the users will adjust their speaking style.

Verification and Undo

Because speech recognition makes mistakes its important for the application to verify the data with the user. The level of verification can vary. An application might explicitly ask if it heard a response correctly if a misrecognition would cause the user significant problems.

If a misrecognition would only cause minor problems the application should mention the recognition result in passing and allow the user to go back and correct the response. For example, if after asking the time the movie application asked what theater, it could phrase the theater question to include the time. "Where do you want to see the movie, 2001, this evening?"

Getting Digits From the User

Even the best speech recognizer has only 99% accuracy per digit. While this might seem high, the user only has a 90% of getting all ten digits in a row correct. Because of this, any sequence of digits that are entered should be played back to the user for verification. If the speech engine misrecognized any of them the application should break the digits into groupings and have the user speak and verify each of the groups. Because the groups have fewer digits they have a higher accuracy. Almost all long sequences of digits are grouped by convention. Example: Phone numbers in the United States of America are two groups of three digits each followed by a group of four digits.

Skipping prompts/menus

Telephony applications that use DTMF can only provide twelve options at each state (0 though 9, * and #). Speech recognition can provide hundreds and even thousands of options at a state. Even though the speech recognition can handle hundreds of responses, it's not possible to list all of the responses to the user because of the slowness of speech playback.

DTMF menus play a game of "20 questions" with the user. A financial telephony application might first ask the user if they want to get their balance or transfer. If they transfer it then asks the user from what account, and then to which account, followed by how much. Each menu has a limited number of responses.

Speech recognition menus need to provide the same architecture so that novice users are guided by the menus and know what to do. As a result, novice users will maneuver through a speech enabled telephony application at about the same speed as one that provides only DTMF. However, experienced users will learn shortcuts.

Once the user has completed a task the slow way, a telephony application might give the user a hint and tell him how to accomplish the task quicker the next time. For example, the financial application might say, "Next time you call, you can just say, 'Transfer $500 from checking to savings." The user will be able to bypass several menus the next time, saving him/her time.

Detect the User's Experience Level

If possible, an application should determine the user's experience with the system. If the user logs on the application can keep statistics about individual him/her over several sessions. Alternatively, the system can notice if the user is having difficulty maneuvering through prompts or is whizzing through them. As an application determines the user's experience level it can:

  • Shorten or elongate prompts depending upon the user's experience.
  • Emphasize DTMF if the user is having problems with speech recognition.
  • Provide more or less help.
  • Send the user to a real operator if they're having a lot of difficulty.

Usability Studies

Traditional GUI applications are intended to be used often and for long periods of time by their users. Any confusing points in the user interface will eventually be overcome through repeated use, and trail and error. The user might even have a manual or help file around to help him/her.

Often, telephony applications are used only occasionally by users, and only for short periods of time. It is likely that large portions of the calls will be from first time users. After all, how many people check the movie listings more than once a week or for more than five minutes at a time?

Because users on average will not have much experience with the application user interface must be as simple and self explanatory as possible. To insure this, the application must be put through a lot of usability studies.

The usability work in a telephony application will typically precede as follows:

  1. Application designers figure out the tasks that a user wishes to accomplish with the application.
  2. A prototype is coded up and a small group of users is given hypothetical tasks to see if they can accomplish them.
  3. The application designers use the feedback from the prototype to modify the application design.
  4. The real application is coded with logging ability. The logging ability keeps track of statistics (listed below) to figure out how successful a call was.
  5. If an existing service is being replaced the a new speech-enabled one, a small percentage of calls are diverted from the existing service into the speech-enabled application. Detailed logging information is kept.
  6. Application designers review the statistics and implement changes to improve performance.
  7. Repeat steps five and six, gradually expanding the number of calls, until the speech-enabled application is handling all of the phone calls.

During the initial stages of an application extensive statistics on performance should be kept. The statistics logging might be disabled when the application goes into full usage, or might be continued so that continual improvement can be made. Some statistics and data to keep are:

  • How many and which users are actually successful at completing their tasks? In many applications it's obvious when a user has completed his task. In a movie listing application the caller will hang up after he has heard the movie time if it is successful, or hang up before the movie time is played if it is not successful. Some applications are more difficult to test and it may be necessary to poll a small sample of users their satisfaction level.
  • How long does it take successful users to complete a task? The less time it takes a user to complete a task the happier they'll be, and fewer telephony servers will be needed.
  • At each state or prompt, how many and which users speak a valid response? If a prompt verifies the data, how often is it wrong? How often does the speech recognizer produce "unrecognized" for a prompt? How long does it take a user to get through a state/prompt? What are the most common user responses? Prompts that take too long to maneuver through, produce a lot of unrecognized results, or a lot of misrecognitions need to be reworked. Features that aren't used can be hidden or removed. The most common responses should be top in the list of responses spoken to the user.
  • In the early stages of usability tests, applications should keep recordings of everything the user says to the speech recognizer. Application designers can listen to the responses and use them to adjust the wording of prompts or the responses that are accepted by the recognizer. If an application developer is striving for accuracy at any cost and is willing to pay the money to have custom speech recognition models created, the speech recognition vendor can use the audio from real-life users to improve accuracy.

Tweaking the prompts

Current speech recognition technology has the flaw that users must say one of the phrases that the computer expects or accuracy will fall. If the user speaks something that the speech recognizer isn't expecting then either the speech recognizer will return an "unrecognized" response, or worse, it will think it heard another command and do something completely different than what the user wanted.

An application designer should pay close attention to the wording of questions since the phrasing and vocabulary will significantly effect whether or not the user is likely to give one of the expected responses. For example, if the movie application wants to know what time the user wishes to see a movie, it could ask, "What time do you want to see the movie?" However, this can produce responses ranging from "This evening" to "7:00" to "Sometime tomorrow." If the question is reworded to, "Do you want to see an afternoon showing, evening showing, or late night showing?" then the user's response will be more limited.

An application should anticipate synonymous responses. Users will tend to use the same phrasing that they hear from the prompts. If the prompts don't hint at any vocabulary or phrasing then the responses will be varied. In the case of the movie time, the application should expect responses like "In the afternoon," "afternoon", and "afternoon showing". Prototypes will show what kind of responses are likely.

Word spotting might work well in some prompts if the recognizer is only looking for a key-word like "afternoon", "evening", or "late night". If more than a few keywords are possible then accuracy decreases.


2nd Speech Center

2nd Speech Center is Award-Winning Text-To-Speech Player to converts any text into spoken words or even MP3/WAV audio files.