What is VoiceXML?
by Susan Smith




Last month we learned all about the different flavors of XML that may be applicable to the GIS community. Hitching its star to the XML wagon is VoiceXML, a development language that shows great promise to GIS, that ultimately will allow users to capture information by voice or telephone.

VoiceXML is a technology standard developed and managed by the VoiceXML Forum, to provide a way to interact with services through a voice interface. VoiceXML Specification has been accepted by the Worldwide Web Consortium (W3C) (www.w3c.org) as the basis for its dialogue markup language. The VoiceXML Forum (www.voicexml.org) has that 1.0 Release available for download.

The work done by the Forum is based upon earlier research by AT&T Labs, IBM, Lucent Technologies' Bell Labs, and Motorola. The Forum was founded by AT&T, Lucent Technologies, Motorola and IBM. GISVision spoke with Dan Furman, President of Lucent Speech Solutions at Lucent Technologies to find out more about VoiceXML and how it may impact the GIS industry.

Does VoiceXML have any relationship with the XML group?

We based our standard on their standard, that is, the W3C XML standard. Voice XML is an application of XML.

What are the advantages of creating a VoiceXML standard?

You have HTML that allows people to create web pages in a standard language that web developers can learn, equipment manufacturers can build to and content providers can use as a way to distribute their content. We want to do the same thing with voice--to create a common language, so developers feel if they learn the language there will be people who will want their skills. They can create content in that language that will be available for users of diverse hardware and software platforms. It's a way of trying to reduce the costs of developing services that allow people to get information over the telephone, by talking or using touch-tones for input and by listening to recorded or synthesized speech.

The goal of the language is to make the creation of those kinds of services inexpensive, standardized and uniform.

Will VoiceXML facilitate the voice querying of databases?

Sure, you can call up and ask for the weather. There's a service called Tell Me, one of a growing number of voice portals. You can say "Give me the weather" for a particular city. Services like this let you speak a request over the telephone, and they will then pull the appropriate information. The hope with VoiceXML is that many content providers will put up their information in VoiceXML, then you can go through portals to capture that information.

I see it working in another, quite a different way. I see people creating a personal web page the way they do on Yahoo and basically having access to their personal web page where ever they are. That's where the value comes from in voice access to the Internet -- people are not going to browse web pages over the phone. They want access to info they know they want, using their phones as well as other devices.

How does this have a bearing on the geographic information systems community?

At some point phones will carry geographic information as to where they are located --at some point you can tie together the geographic location of the caller with the request --to find the nearest gas station, for example.

You'd like to just say, "Get me restaurant directions" while you're driving, and get that information without having to pick up a phone or use a handheld device.

A big capability for GIS would be a wireless telephone capable of reporting its position. At the moment that's a dream. People want devices to be small, which is in conflict with their desire for greater functionality.

VoiceXML could be used by application developers to write very quickly and economically software that would do a very specialized task, to give you voice access to a database or an interactive service. The difference is that you don't need to be a expert in speech technology, although it helps, and you don't need to know much about how networks operate, in order to use the language to write an application that lets you draw on the power of speech technology and the Internet. With HTML you don't have know how a router works -- they've really made it easy for a web page developer just to understand web pages. We're trying to do the same thing with voice, you don't have to know how the telephone network works, you don't need to know how the database access works, you just have to know what you want to do.

You want to be able to ask a database questions, and since many people are using the database you probably want to be able to synthesize out what it's saying. VoiceXML makes speech technology, which allows computers to talk and to listen, more accessible to more people - developers, content and service providers, and their customers.

Will VoiceXML require that you use key words, for example, when you do a search on a computer now you must enter specific keywords and perhaps add a plus sign?

It takes the public awhile to learn to say things that a computer can understand. Understanding natural language is not very easy. It could be like Movie Fone, where the database asks you for the name of a movie and you have a menu choice.

What are some of the greatest challenges of speech technology?

In order to recognize what the person says you have to tell the computer this is the range of ways people might say it. It's not simple because you have to put in the various phonetic transcriptions of the way people say things. It's part of what's called the grammar specification.

We have a lot more people from other countries with accents. How can the technology keep up with those variations?

There has been a lot of progress in recent years in producing speaker-independent recognition systems that can deal with wide variations in regional and other accents.

It helps to keep the problem to a limited domain. For example, if I ask you what movie are you interested in, there are only 100-200 movies that people are really interested in at one time. The computer can be better than a human being in terms of listening to accents--it's stupid but it does well if it has a limited task. The computer will do better recognizing digits in a phone number than a human

What is Lucent's role?

Lucent is involved because it is a leader in networking, speech technology, and software, and because we believe standards help to grow markets.Lucent founded the VoiceXML Forum with Motorola, AT&T and IBM. The four companies felt that if we worked together to produce an open standard, we and other companies could reap the benefits of an expanding market for speech technology, applications, and communication services.

I think the VoiceXML Forum now has around two hundred supporting companies. There's a momentum building to make this happen. We are working with the World Wide Web Consortium (W3C). We submitted the 1.0 to them to consider for formal standardization and they have accepted it as the basis of their dialogue markup language.

Within the Forum, we are trying to maintain the right balance between working in small groups so we can go fast and keeping it democratic and open so we can use the best ideas that are out there.

Susan Smith is the executive editor of GISVision. She can be reached at susan.smith@ibsystems.com

Links:

For information on VoiceXML, see www.voicexml.org.

For insight into speech technology from Lucent's Bell Labs, see
www.lucent.com/speech/welcome.html
www.bell-labs.com/project/tts/
www.bell-labs.com/org/1133/Research/SpokenDialogSystems/index.html