Telephony applications are applications that are accessed via the telephone rather than locally over the PC. A GUI application may also support telephony features, although the user interface design for the two interaction mechanisms are significantly different. Many GUI applications support telephony because of the flexibility that a long-distance connection to the PC provides.
Here are some typical telephony applications:
It's obvious why voice-mail and movie-listings applications should be accessible by the phone. But why GUI applications? GUI applications might provide telephony functionality for the following reason:
Contemporary telephony applications use recordings to play audio to the users, and wait for DTMF (touch-tone) responses from the user. Speech technology has the following advantages over DTMF and recorded audio:
Telephony applications use the same speech recognition engines used for Command and Control speech recognition, and the same text-to-speech engines used on the PC.
These hardware and software requirements should be considered when designing a speech application:
If you haven't already, you should look at the Voice Commands and Voice Text sections of the Microsoft Speech API documentation. Telephony applications use the same technologies.
Half-duplex Voice Modems
Early low-end "Voice Modems" were basically created by bolting record and playback ability onto an existing FAX/MODEM chip-set. Although this made these voice modems cheap, they were lacking some critical features to make speech recognition viable.
Low-cost voice modems are "half-duplex" and do not support "full-duplex" audio. "Half-duplex" means that an application cannot record and play audio at the same time. Because of this, an important feature, "barge-in" is not possible.
"Barge-in" allows a user to interrupt a prompt being spoken by the telephony application. With barge-in, a user can listen to the list of options and speak the one he wants when he hears it. Without barge-in the user must wait for all of the options to be read before speaking. Having to listen to a long prompt can get annoying.
Applications that expect to run on half-duplex voice modems need to make sure that they design around the problems of half-duplex audio, perhaps even using shorter prompts when barge-in is not available on the machine.
Note: As of 1998, most modems are not voice-capable, and of those that are, most aren't a good match for a telephony server:
Although you can use a typical voice modem for testing, we suggest that you use a high-end card for real-world products.
Speech recognizers for the local PC use are designed to work well for most speakers. Some users, because of a strange accent, dialect, or voice, will not get good accuracy. On the local PC this is not that large of a problem because the user can always train the speech recognizer to his/her voice, and if that doesn't work he/she can always use the keyboard or mouse.
If a telephony recognizer does not work well for an individual then he/she cannot easily use the application. Often times it is not possible for him/her to train the speech recognizer to his/her voice. Additionally, many cellular phones (and wired phones) have such lousy audio quality that humans can barely understand speech on them, let alone computers. Failure of speech recognition will also mean failure of the application.
Application designers need to be aware that speech recognition will not work for some users, so the users must have a back off. A back off can take several forms:
Ideally, a call into a computer telephony application is just like (or better than) calling a real person. Using airline reservations as an example, it would be ideal for a user to call up a phone number and be able to have the following conversation with their computer:
Computer: Hello. Acme air travel.
User: Hi, could I book a reservation.
Computer: Sure, where do you want to fly from?
User: Yeah, uh, I want to fly from Seattle to Buffalo.
Computer: What day do you want to depart from Seattle.
Computer: Is that Monday the twenty third?
User: Yea, and I want to arrive back on Friday.
Computer: Friday the twenty seventh?
User: Uh huh.
Computer: Okay. What time would you like to leave Seattle on Monday?
User: Ten O'clock. By the way, what's the weather supposed to be like?
The conversation simulates a human being in its ease of use and flexibility.
Because of the limited number of responses available from DTMF, and the need to explain what each number means in the particular context, today's IVR systems are more like this:
Computer: Welcome to Acme Air Travel. For departure information, press 1. For ticketing, press 2.
You've undoubtedly had to deal with this.
The technology necessary to simulate the fully conversational application is years away. It not only requires very accurate large-vocabulary speech recognition, but it needs natural language understanding, and dialogue understanding. Speech recognition is easy compared to the understanding and dialogue part.
Current speech recognition technology can provide a dialogue more like this:
Computer: Hello. Acme air travel. Do you want to inquire about today's flights or purchase a ticket.
User: I'd like to buy a ticket.
Computer: What city do you wish to depart from?
User: Uh, Seattle.
Computer: Did you say, "Seattle?"
Computer: Where do you want to fly?
User: I want to fly to Buffalo.
Computer: I'm sorry. I didn't understand. Please say the city you want to fly to.
While it's not as friendly as the conversational UI, the speech UI is much better than DTMF. Notice that the user is led through the system and that his/her responses are limited to individual items, and he/she can't combine them together in a single statement. In a fully conversational system, the user can steer the conversation and combine any number of responses together. As the speech recognition, natural language, and dialogue technologies improve, the user interface will gain more of the components of a conversational UI.
The Microsoft telephony architecture is designed to facilitate user interface that current technology can handle and gradually improve to provide a conversational UI.
Breaking the dialogue into smaller parts
Because current speech technology (and natural language understanding/dialogue technology too) must work in a very limited domain, applications must lead the user through the conversation, asking for specific pieces of information. The airline ticket ordering application needs to ask for the following pieces of information:
To fill in each of the fields, several questions must be asked. The "departure city" field might require:
As you can see, getting the departure city is actually pretty complicated. It's easy if the user speaks the right city and the computer recognizes it properly. However, recognition makes mistakes (recognizing "San Francisco" instead of "Seattle", and more often, users don't give an expected response. (User says, "Uh, I want to leave from, uh, Seattle, please. That's in Washington State.")
A large portion of the code written to get any piece of information is handling the error conditions. In the case of a recognition error, it's verifying the recognition result. In the case of a user providing an unexpected response, it's getting the user to give an expected response.
The application designer must segment any dialogue into small, controllable, sub-dialogues. Each sub-dialogue gets a specific piece of information from the user.
Most telephony applications are designed to handle several phone lines coming into the same PC. Multi-line telephony applications need to be designed to handle the multiple input channels in such a way that one channel doesn't slow down or harm another channel.
The easiest multi-line application has one process running at least one thread per phone line. Because each line has its own thread, the lines are independent and (generally) one line will not cause another line to slow down. Multi-threaded lines also allow for improved performance on multi-processor CPUs. However, if one of the lines in a multi-threaded application causes a GP Fault, all of the threads will die.
The most stable multi-line telephony design is to have one process per phone line. This insures that one phone line cannot crash and pull down the other lines. It also parallelizes well. It is more difficult to code.
Design the Application for Speech Recognition
Many applications try to use speech recognition by altering the menus and structure only slightly. Rather than telling the user, "To transfer money press one, to open an account press two, etc.." they say, "To transfer money press or say one, etc." While this makes the addition of speech recognition easy it does not take full advantage of it.
An application using speech recognition should be designed from the start to use speech recognition as a primary input device. DTMF should be available, but only as a back-off for those few users that cannot get speech recognition to work.
Telephones don't have monitors, keyboards, or mice
Telephones don't have monitors, keyboards, or mice. Although this is not a limit of the speech recognition and text-to-speech technologies, it does require that application designers rely heavily on the technologies. Because of this, the technologies must be used even for purposes where they don't perform as well as monitors, keyboards, and mice. The lack of other input and output devices significantly changes the user interface. Application designers should be aware of the following effects:
Telephony Users Often Have Different Motivations
A typical PC users has experience using the computer and the software. They either have personally purchased the software or need to use it to get their job done.
Many telephony applications (like those that provide movie listings) are used by people that have completely different motivations than PC users. Telephony users don't want to talk to a computer; they want to talk to a human to get their task completed. The telephony application is merely a substitute for the real thing, and almost always inferior to the human. However, it's an annoyance they have to live with. Because of this telephony users are less forgiving of poor user interface design and are quick to hang up.
Furthermore, telephony users are a more diverse population. They are from all age groups and education levels. Many of them don't even know that they're talking to a computer and just think it' a very dense operator on the other side of the phone. Telephony applications using speech recognition get better "accuracy" if the users know that they're talking to a computer because the users will adjust their speaking style.
Verification and Undo
Because speech recognition makes mistakes its important for the application to verify the data with the user. The level of verification can vary. An application might explicitly ask if it heard a response correctly if a misrecognition would cause the user significant problems.
If a misrecognition would only cause minor problems the application should mention the recognition result in passing and allow the user to go back and correct the response. For example, if after asking the time the movie application asked what theater, it could phrase the theater question to include the time. "Where do you want to see the movie, 2001, this evening?"
Getting Digits From the User
Even the best speech recognizer has only 99% accuracy per digit. While this might seem high, the user only has a 90% of getting all ten digits in a row correct. Because of this, any sequence of digits that are entered should be played back to the user for verification. If the speech engine misrecognized any of them the application should break the digits into groupings and have the user speak and verify each of the groups. Because the groups have fewer digits they have a higher accuracy. Almost all long sequences of digits are grouped by convention. Example: Phone numbers in the United States of America are two groups of three digits each followed by a group of four digits.
Telephony applications that use DTMF can only provide twelve options at each state (0 though 9, * and #). Speech recognition can provide hundreds and even thousands of options at a state. Even though the speech recognition can handle hundreds of responses, it's not possible to list all of the responses to the user because of the slowness of speech playback.
DTMF menus play a game of "20 questions" with the user. A financial telephony application might first ask the user if they want to get their balance or transfer. If they transfer it then asks the user from what account, and then to which account, followed by how much. Each menu has a limited number of responses.
Speech recognition menus need to provide the same architecture so that novice users are guided by the menus and know what to do. As a result, novice users will maneuver through a speech enabled telephony application at about the same speed as one that provides only DTMF. However, experienced users will learn shortcuts.
Once the user has completed a task the slow way, a telephony application might give the user a hint and tell him how to accomplish the task quicker the next time. For example, the financial application might say, "Next time you call, you can just say, 'Transfer $500 from checking to savings." The user will be able to bypass several menus the next time, saving him/her time.
Detect the User's Experience Level
If possible, an application should determine the user's experience with the system. If the user logs on the application can keep statistics about individual him/her over several sessions. Alternatively, the system can notice if the user is having difficulty maneuvering through prompts or is whizzing through them. As an application determines the user's experience level it can:
Traditional GUI applications are intended to be used often and for long periods of time by their users. Any confusing points in the user interface will eventually be overcome through repeated use, and trail and error. The user might even have a manual or help file around to help him/her.
Often, telephony applications are used only occasionally by users, and only for short periods of time. It is likely that large portions of the calls will be from first time users. After all, how many people check the movie listings more than once a week or for more than five minutes at a time?
Because users on average will not have much experience with the application user interface must be as simple and self explanatory as possible. To insure this, the application must be put through a lot of usability studies.
The usability work in a telephony application will typically precede as follows:
During the initial stages of an application extensive statistics on performance should be kept. The statistics logging might be disabled when the application goes into full usage, or might be continued so that continual improvement can be made. Some statistics and data to keep are:
Tweaking the prompts
Current speech recognition technology has the flaw that users must say one of the phrases that the computer expects or accuracy will fall. If the user speaks something that the speech recognizer isn't expecting then either the speech recognizer will return an "unrecognized" response, or worse, it will think it heard another command and do something completely different than what the user wanted.
An application designer should pay close attention to the wording of questions since the phrasing and vocabulary will significantly effect whether or not the user is likely to give one of the expected responses. For example, if the movie application wants to know what time the user wishes to see a movie, it could ask, "What time do you want to see the movie?" However, this can produce responses ranging from "This evening" to "7:00" to "Sometime tomorrow." If the question is reworded to, "Do you want to see an afternoon showing, evening showing, or late night showing?" then the user's response will be more limited.
An application should anticipate synonymous responses. Users will tend to use the same phrasing that they hear from the prompts. If the prompts don't hint at any vocabulary or phrasing then the responses will be varied. In the case of the movie time, the application should expect responses like "In the afternoon," "afternoon", and "afternoon showing". Prototypes will show what kind of responses are likely.
Word spotting might work well in some prompts if the recognizer is only looking for a key-word like "afternoon", "evening", or "late night". If more than a few keywords are possible then accuracy decreases.