Author: Mihai Sucan (ROBO Design)
This tutorial aims to help you add voice interactivity to your site, with minimal code changes and maximal browser compatibility.
Along the way, examples will be provided, and at the end, you will be able to test a fully working, real World, voice-enabled site. This tutorial describes the use of a reusable VoiceXML form.
Because the voice capability is included in the browser, you do not need to write your own speech recognition engine or speech synthesizer. This is a great advantage to you and to your Web application users:
You should be familiar with the following:
Adding voice interactivity to your site can be done with any document, but it is recommended you use a clean semantic markup code, along with a CSS layout.
It is recommended that you read the existing tutorials from Opera Software: Authoring XHTML+Voice. These are very useful introductory articles.
To make Voice work you need to convert your existing HTML documents to well-formed XHTML. Because of this, your server also has to be configured to send the documents with the application/xhtml+xml MIME type (or any other XML MIME type). Doing so is actually easy and not as a great problem as it was years ago.
Recommended reading:
Let us start from a simple document:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>PRO-Net - Example 1</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="stylesheet" type="text/css" href="example1.css">
</head>
<body>
<p id="skipnav"><a href="#content">Skip the navigation.</a></p>
<h1 id="header"><a href="site/">PRO-Net</a></h1>
<div id="nav">
<ul id="menus1">
<li><a href="site/offers">Offers</a></li>
<li><a href="site/support/dialup">Support</a></li>
<li><a href="site/contact">Contact</a></li>
</ul>
<ul id="menus2">
<li><a href="site/support/dialup">Dialup</a></li>
<li><a href="site/support/email">Email configuration</a></li>
<li><a href="site/support/connecting">Connecting to PRO-net</a></li>
</ul>
</div>
<h1 id="pagetitle">Example 1</h1>
<div id="content">
<p>Example page.</p>
</div>
<div id="footer">
<a id="backtotop"href="#header">Back to top</a>
<p>Tutorial example</p>
</div>
</body>
</html>
Now let us add the simple VoiceXML form into the <head>
section:
<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
<block>
<value expr="voice_ptitle()" />
</block>
</form>
The <form>
is the VoiceXML form container. Currently we only have a <block>
which reads the page title. We use the expression attribute because we will have a JavaScript function that returns the page title.
Note: It is tempting to make the form automatically read the content of the page, on page load. In practice, this is not as good as it seems. Reading only the title is enough. A voice command for reading the content of the page will be provided.
We will discuss the JavaScript in the following section.
Here we shall add the voice command input <field>
in <form>
:
<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
<!-- ... snippet ... -->
<field name="usrcmd">
<grammar type="application/srgs" src="example2.gram" />
<prompt timeout="10s">
<ss:break time="5s" />
<value expr="document.voice_msg['prompt']" />
</prompt>
<prompt count="2" timeout="600s">
<ss:break time="300s" />
<value expr="document.voice_msg['prompt']" />
</prompt>
<catch event="help nomatch noinput">
<value expr="document.voice_msg[_event]" />
<reprompt />
</catch>
</field>
</form>
The <grammar />
tag loads an external grammar file. With grammars you control the speech recognition engine, giving it the possible commands the user can say. Further details are provided in the following chapter.
With prompts you can have texts synthesized. In this case the prompt reads "Please input your command" which is provided in the JavaScript (see "The script" chapter).
We have one prompt which is heard after 15 seconds and the subsequent prompts are played at much larger intervals. We do not want to annoy visitors repeating the same message very often.
We use the catch event for playing the messages appropriate for each event, and we have to use <reprompt />
to avoid playing the noinput
event message forever.
The SSML breaks are used for avoiding the automatic reading of the <prompt>
message when executing <reprompt />
in the event catcher.
Note: All messages are stored in JavaScript variables, because if we put them inside the VoiceXML form, these will appear in legacy browsers with no CSS support.
The final part and the most important one is the <filled>
section which gets executed when the user says something that has been recognized:
<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
...
<field name="usrcmd">
...
<filled>
<assign name="voice_result"
expr="voice_done(application.lastresult$);" />
<if cond="voice_result == 'event-nomatch'">
<clear namelist="usrcmd" />
<throw event="nomatch" />
<elseif cond="voice_result.action == 'prompt-element'" />
<prompt xv:expr="voice_result.src" />
<clear namelist="usrcmd" />
<elseif cond="voice_result.action == 'prompt-value'" />
<value expr="voice_result.message" />
<clear namelist="usrcmd" />
</if>
</filled>
</field>
</form>
The assigned function voice_done()
is the one that is going to process the interpretion of what the user said, stored in application.lastresult$.
The returned value is assigned to voice_result because this way we can "communicate" from the JavaScript function with the current VoiceXML form.
In the above code we have 3 "actions":
voice_result == 'event-nomatch'
for simulating the nomatch
event. Your script could determine that something does not exist (e.g. news some item 10).voice_result.action == 'prompt-element'
for reading a part from the document, using the #element-id.voice_result.action == 'prompt-value'
for reading a message. The message can be programatically generated by the script - it's unlimited.The grammar tells what user utterances will be matched.
#ABNF 1.0 utf-8;
language en;
mode voice;
root $command;
tag-format <semantics/1.0>;
public $intropage = (go to | visit | jump to | load) [the];
public $introspeak = speak | read | narrate | talk;
$pages = $intropage (
(start | home | first | front) {$ = "site/";} |
offers {$ = "site/offers";} |
support {$ = "site/support/dialup";} |
contact {$ = "site/contact";}
) [page];
$speakers = $introspeak (
((header | navigation | menus | menu | main) [bar]) {$ = "#nav";} |
(page | site | content) {$ = "#content";}
);
public $command =
$pages {$.action = "load-page"; $.page = $$;} |
$speakers {$.action = "prompt-element"; $.src = $$;};
We have the root rule $command with "commands", one for speaking something, and the other for loading another page.
These are represented by the $speakers and $pages rules respectively, which in turn call on other rules. In this grammar, utterances like "visit the support page" or "read menus" will match, but "go to bed" will not.
You can modify the grammar to your liking and user testing. You might, for instance, discover that some users say "read the menus" (which will not match) instead of "read menus" (that will). To accomodate them you can change the rule to be "$introspeak = (speak | read | narrate | talk) [the];"
. Or if you later want to extend the pages list, you can modify the $pages rule.
Note: The content inside the curly braces is referred to as semantic interpretation. Here we create the JavaScript objects we are going to use in the following section.
This approach simplifies processing, you do not have to care about what the user actually said ("speak the menus" or "read navigation bar" will both read out the available navigation options). You can tailor the grammar to match accepted user input to the program logic.
Let us start with the messages:
if(document.addEventListener)
document.addEventListener('load', function ()
{
document.voice_msg = {
'help' : 'You can say: speak page, speak navigation, speak content.',
'nomatch' : 'Try again.',
'noinput' : 'If you need help, ask for help.',
'prompt' : 'Please input your command.',
'notitle' : 'Untitled document'
};
}, false);
You may wonder why I code these messages in JavaScript instead of using the native VoiceXML elements for the same events (like <help>
and <nomatch>
). Any VoiceXML browser must support JavaScript, so speech engine compatibility is not affected. However, there are also a few benefits: legacy browsers will not end up displaying the VoiceXML contents on the page, and you can have dynamic messages based on any conditions you want.
Let us continue with the first function:
function voice_ptitle()
{
var elem = document.getElementsByTagName('title')[0];
if(!elem || !elem.firstChild)
return document.voice_msg.notitle;
else
return elem.firstChild.data;
}
We use this method as a workaround for the limitation in the (X)HTML standard that does not allow an ID attribute on the <title>
element. Thus we cannot use <prompt xv:src="#element-id" />
and still have a valid document.
document.title is not used here, because if the page has no title, the browser sets the page location as the title. Reading long and complex URLs is very unpleasant.
Finally, the voice command handler function:
function voice_done(val)
{
if(!val || !val.interpretation)
return 'event-nomatch';
var si = val.interpretation;
if(si.action == 'load-page')
{
document.location = si.page;
return '';
} else if(si.action == 'prompt-element' && si.src)
return si;
else
return 'event-nomatch';
}
As previously mentioned, the voice_done()
function is called when the user says something that is recognized. The val argument is the user utterance with several extra properties. Among the most important ones are:
For this tutorial we will use the convention that the interpretation object should always have the action property. We can later add properties specific for each type of action.
action = 'load-page'
means go to page. From the grammar semantic interpretation we also include the page property. The above script sets document.location = si.page;
which is exactly what we want: to load another page. The return value is an empty string, because we want the VoiceXML form to finish without doing anything else.action = 'prompt-element'
tells the script to read a part of the document using the element ID. Here we also have a new src property providing the ID of the element we want. As you can see, what we return the SI object to the form. Remember that we wanted to be able to do voice-related things. The script itself cannot do any page reading, therefore we return to the VoiceXML form.That's all! The script is not a big deal.
We will add a small speech style sheet for the text to be spoken.
head form {
display: none;
}
h1, h2, h3, h4, h5, h6 {
voice-family: female;
pause: 2s;
}
div, tr {
pause: 1s;
}
li, th, td {
pause: 500ms;
}
#content {
pause-before: 2s;
}
#menus1 a:before, #menus2:after, #menus2 li:before {
speak: none;
}
Nothing too fancy here. We have added some pauses, made sure the headlines are spoken in a female voice, and avoid speaking the menu numbers. The form used in the examples is designed to have no textual content at all, with all its spoken contents retrieved from the script file, or the page itself. However, it may be given default styles by the browser, and at some point you may wish to put textual content inside it. The rule head form {display: none}
ensures the voice form content will not display in CSS supporting browsers.
You need to add the CSS and the script in the <head>
section, change the DTD from XHTML to XHTML+Voice, add the XML namespaces, add an XML Event for the <body>
tag:
<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
"http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:ss="http://www.w3.org/2001/10/synthesis"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
<head>
...
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="example2.css" />
<script type="text/javascript" src="example2.js"></script>
<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
...
</form>
...
</head>
<body ev:event="load" ev:handler="#readpage">
...
</body>
</html>
The XML Event activates the VoiceXML form on page load.
Normal execution goes like this:
A summary of the changes made to the initial code:
The code presented here is extendable. For a local experiment I have added page-specific JavaScript actions, page-specific grammars, and page-specific VoiceXML forms. The above example you see is the "parent" of that voice-enabled site. This is reusable code you can copy/paste in your site and update the grammar accordingly.
You are now able to add Voice to an entire site: easily, fast, in a reusable manner, and maintaining compatibility.
You can try the example site. For the front page there is a new action for loading the Nth news page. There is also a voice command for saying the access key, like "press key 2". Take a look at the JavaScript and the grammar file, for the front page.
Also, try the administration module which allows you to add new pages and news items in the site. Grammars are dynamically generated by the server-side scripts.
Voice has been available in normal Opera releases since the technical previews of version 7.6. The early implementation was buggy, but now things are much more stable.
One of the biggest issues is missing support for VoiceXML DOM. You cannot modify the VoiceXML forms from JavaScript. The example script attempts to workaround by storing the strings in JavaScript. The actual form logic remains the same and cannot be changed by JavaScript, but the strings are no longer hard-coded.
Another marginal issue, not related to Voice alone, is you cannot have an ID attribute set for the <title>
element. You can use it, but the document will not validate. This is because XHTML is an XML-based reformulation of HTML which itself did not allow having an ID. Because of this, you cannot have your VoiceXML prompts directly read the page title - you have to work around the limitation using JavaScript.
Lastly, you cannot put VoiceXML messages in the page itself, because legacy browsers with no CSS support will show them. The work around, which actually can provide advantages as previously explained, is to use JavaScript.
There is no specific tool for doing editing X+V, however there is no great need for such. Any (X)HTML/XML editor is more than satisfactory. For Linux or UNIX, you may want to use Quanta+ (this tutorial was made with it). For Windows you could try an editor such as UltraEdit. For Mac, BBEdit is an alternative.
If you are a web developer not using Windows, you can still work with Opera and Voice. You have two choices: WINE may be available for your operating system, or you can use a virtual computer such as VMWare Player or VirtualPC. I personally use WINE for most of the work, at the end I just check the results in VMWare. This is because WINE is quite noticeably faster and more convenient to use.
Opera supports all X+V. Note that this XHTML profile does not include support for all VoiceXML 2.0. Specifically, telephony-related features are excluded.
Opera supports all of SRGS 1.0. This means grammars in both forms: ABNF and XML. As expected, there's no support for the DTMF mode - only voice is supported.
One rather important bug is that the $NULL special rule cannot be used for now, since it crashes the browser.
Opera only supports Semantic Interpretation Script Tags, no support for String Literals. Also, the current implementation is slightly outdated, being based on an older working draft. One of the most proeminent difference caused by this: in Opera you need to use $ instead of out.
This grammar format is also fully supported. However, the <NULL> special rule also crashes Opera.
All CSS 3 Speech properties that are not also available in the Aural CSS 2.1 have to be prefixed with -xv-
, since this specification is only a working draft - it's not yet a candidate recommendation.
The current Opera implementation uses an older draft of CSS 3 Speech, this means that the rest-*
and mark-*
properties are not supported.
Aural CSS 2.1 properties also in CSS 3 Speech and supported by Opera are: cue-*
, pause-*
, speak
and voice-family
.
One of the most important implementation surprises is that Opera does not yet support generated content in CSS for the speech media.
An interpret-as
property would be very much needed. This would allow web developers to tell the speech engine how to read the text, as a date, as time, etc.
The current implementation does not allow web developers to apply their CSS styling to VoiceXML forms. One must use SSML for this purpose.
Opera provides support for voice
, emphasis
, and break
.
Greatly needed would be support for say-as
(see interpret-as
discussion above).
Opera provides complete support for XML Events.
Specifications:
Tutorials, and documentation:
Note: The documentation found on some of these links is valuable, however, each company/corporation has its own extensions. Some of them are not marked as such. Great care should be taken.
Where to ask for help: