Tutorial: How to add voice interactivity to your site, quickly, and maintaining backwards compatibility

Author: Mihai Sucan (ROBO Design)

Introduction
Prerequisites
Converting to XHTML
Your XHTML+Voice document
Final words
Implementation notes
Resources

Introduction

This tutorial aims to help you add voice interactivity to your site, with minimal code changes and maximal browser compatibility.

Along the way, examples will be provided, and at the end, you will be able to test a fully working, real World, voice-enabled site. This tutorial describes the use of a reusable VoiceXML form.

Because the voice capability is included in the browser, you do not need to write your own speech recognition engine or speech synthesizer. This is a great advantage to you and to your Web application users:

You do not need to learn C/C++ programming language, nor a custom programming language in which you can develop your own ASR and TTS.
If each site would provide such implementations security would become a concern, since such implementations need to be given greater access to the visitor system (plugins, ActiveX, extensions, etc) - which is beyond normal Web pages.
Having the user agent implement all of this is convenient for the user who can adapt to the behaviour of the browser.
Last, but not least, it would prove very inconvenient to users who would hear synthesized voices of varying quality. The same goes for varying degrees of speech recognition capabilities.

Prerequisites

You should be familiar with the following:

Adding voice interactivity to your site can be done with any document, but it is recommended you use a clean semantic markup code, along with a CSS layout.

It is recommended that you read the existing tutorials from Opera Software: Authoring XHTML+Voice. These are very useful introductory articles.

Converting to XHTML

To make Voice work you need to convert your existing HTML documents to well-formed XHTML. Because of this, your server also has to be configured to send the documents with the application/xhtml+xml MIME type (or any other XML MIME type). Doing so is actually easy and not as a great problem as it was years ago.

Do not forget to remove the XML prolog, if you have it. This is known to break the rendering in Internet Explorer 6.
Send the document as application/xhtml+xml only if the HTTP Accept header sent by the browser contains this MIME type, otherwise use text/html. Unconditioned sending of application/xhtml+xml breaks legacy browsers.

Your XHTML+Voice document

Let us start from a simple document:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <title>PRO-Net - Example 1</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <link rel="stylesheet" type="text/css" href="example1.css">
</head>
<body>
  <p id="skipnav"><a href="#content">Skip the navigation.</a></p>
  <h1 id="header"><a href="site/">PRO-Net</a></h1>
  <div id="nav">
    <ul id="menus1">
      <li><a href="site/offers">Offers</a></li>
      <li><a href="site/support/dialup">Support</a></li>
      <li><a href="site/contact">Contact</a></li>
    </ul>
    <ul id="menus2">
      <li><a href="site/support/dialup">Dialup</a></li>
      <li><a href="site/support/email">Email configuration</a></li>
      <li><a href="site/support/connecting">Connecting to PRO-net</a></li>
    </ul>
  </div>
  <h1 id="pagetitle">Example 1</h1>
  <div id="content">
    <p>Example page.</p>
  </div>
  <div id="footer">
    <a id="backtotop"href="#header">Back to top</a>
    <p>Tutorial example</p>
  </div>
</body>
</html>

View example.

The form

Now let us add the simple VoiceXML form into the <head> section:

<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
  <block>
    <value expr="voice_ptitle()" />
  </block>
</form>

The <form> is the VoiceXML form container. Currently we only have a <block> which reads the page title. We use the expression attribute because we will have a JavaScript function that returns the page title.

Note: It is tempting to make the form automatically read the content of the page, on page load. In practice, this is not as good as it seems. Reading only the title is enough. A voice command for reading the content of the page will be provided.

We will discuss the JavaScript in the following section.

Here we shall add the voice command input <field> in <form>:

<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
  <!-- ... snippet ... -->
  <field name="usrcmd">
    <grammar type="application/srgs" src="example2.gram" />

    <prompt timeout="10s">
      <ss:break time="5s" />
      <value expr="document.voice_msg['prompt']" />
    </prompt>

    <prompt count="2" timeout="600s">
      <ss:break time="300s" />
      <value expr="document.voice_msg['prompt']" />
    </prompt>

    <catch event="help nomatch noinput">
      <value expr="document.voice_msg[_event]" />
      <reprompt />
    </catch>
  </field>
</form>

The <grammar /> tag loads an external grammar file. With grammars you control the speech recognition engine, giving it the possible commands the user can say. Further details are provided in the following chapter.

With prompts you can have texts synthesized. In this case the prompt reads "Please input your command" which is provided in the JavaScript (see "The script" chapter).

We have one prompt which is heard after 15 seconds and the subsequent prompts are played at much larger intervals. We do not want to annoy visitors repeating the same message very often.

We use the catch event for playing the messages appropriate for each event, and we have to use <reprompt /> to avoid playing the noinput event message forever.

The SSML breaks are used for avoiding the automatic reading of the <prompt> message when executing <reprompt /> in the event catcher.

Note: All messages are stored in JavaScript variables, because if we put them inside the VoiceXML form, these will appear in legacy browsers with no CSS support.

The final part and the most important one is the <filled> section which gets executed when the user says something that has been recognized:

<form xmlns="http://www.w3.org/2001/vxml" id="readpage">
  ...
  <field name="usrcmd">
    ...
    <filled>
      <assign name="voice_result"
              expr="voice_done(application.lastresult$);" />

      <if cond="voice_result == 'event-nomatch'">
        <clear namelist="usrcmd" />
        <throw event="nomatch" />

      <elseif cond="voice_result.action == 'prompt-element'" />
        <prompt xv:expr="voice_result.src" />
        <clear namelist="usrcmd" />

      <elseif cond="voice_result.action == 'prompt-value'" />
        <value expr="voice_result.message" />
        <clear namelist="usrcmd" />
      </if>
    </filled>
  </field>
</form>

The assigned function voice_done() is the one that is going to process the interpretion of what the user said, stored in application.lastresult$.

The returned value is assigned to voice_result because this way we can "communicate" from the JavaScript function with the current VoiceXML form.

In the above code we have 3 "actions":

voice_result == 'event-nomatch' for simulating the nomatch event. Your script could determine that something does not exist (e.g. news some item 10).
voice_result.action == 'prompt-element' for reading a part from the document, using the #element-id.
voice_result.action == 'prompt-value' for reading a message. The message can be programatically generated by the script - it's unlimited.

The grammar

The grammar tells what user utterances will be matched.

#ABNF 1.0 utf-8;

language en;
mode voice;

root $command;
tag-format <semantics/1.0>;

public $intropage = (go to | visit | jump to | load) [the];

public $introspeak = speak | read | narrate | talk;

$pages = $intropage (
  (start | home | first | front) {$ = "site/";} | 
  offers {$ = "site/offers";} | 
  support {$ = "site/support/dialup";} | 
  contact {$ = "site/contact";}
  ) [page];

$speakers = $introspeak (
  ((header | navigation | menus | menu | main) [bar]) {$ = "#nav";} |
  (page | site | content) {$ = "#content";}
  );

public $command =
  $pages {$.action = "load-page"; $.page = $$;} |
  $speakers {$.action = "prompt-element"; $.src = $$;};

We have the root rule $command with "commands", one for speaking something, and the other for loading another page.

These are represented by the $speakers and $pages rules respectively, which in turn call on other rules. In this grammar, utterances like "visit the support page" or "read menus" will match, but "go to bed" will not.

You can modify the grammar to your liking and user testing. You might, for instance, discover that some users say "read the menus" (which will not match) instead of "read menus" (that will). To accomodate them you can change the rule to be "$introspeak = (speak | read | narrate | talk) [the];". Or if you later want to extend the pages list, you can modify the $pages rule.

Note: The content inside the curly braces is referred to as semantic interpretation. Here we create the JavaScript objects we are going to use in the following section.

This approach simplifies processing, you do not have to care about what the user actually said ("speak the menus" or "read navigation bar" will both read out the available navigation options). You can tailor the grammar to match accepted user input to the program logic.

The script

Let us start with the messages:

if(document.addEventListener)
  document.addEventListener('load', function ()
{
  document.voice_msg = {
    'help'    : 'You can say: speak page, speak navigation, speak content.',
    'nomatch' : 'Try again.',
    'noinput' : 'If you need help, ask for help.',
    'prompt'  : 'Please input your command.',
    'notitle' : 'Untitled document'
    };
}, false);

You may wonder why I code these messages in JavaScript instead of using the native VoiceXML elements for the same events (like <help> and <nomatch>). Any VoiceXML browser must support JavaScript, so speech engine compatibility is not affected. However, there are also a few benefits: legacy browsers will not end up displaying the VoiceXML contents on the page, and you can have dynamic messages based on any conditions you want.

Let us continue with the first function:

function voice_ptitle()
{
  var elem = document.getElementsByTagName('title')[0];
  if(!elem || !elem.firstChild)
    return document.voice_msg.notitle;
  else
    return elem.firstChild.data;
}

We use this method as a workaround for the limitation in the (X)HTML standard that does not allow an ID attribute on the <title> element. Thus we cannot use <prompt xv:src="#element-id" /> and still have a valid document.

document.title is not used here, because if the page has no title, the browser sets the page location as the title. Reading long and complex URLs is very unpleasant.

Finally, the voice command handler function:

function voice_done(val)
{
  if(!val || !val.interpretation)
    return 'event-nomatch';

  var si = val.interpretation;

  if(si.action == 'load-page')
  {
    document.location = si.page;
    return '';
  } else if(si.action == 'prompt-element' && si.src)
    return si;
  else
    return 'event-nomatch';
}

As previously mentioned, the voice_done() function is called when the user says something that is recognized. The val argument is the user utterance with several extra properties. Among the most important ones are:

confidence is the confidence level of the recognized utterance (voice command). This can prove to be very useful to check how sure is the speech recognition engine of what it recognized. On a banking site, for example, a web application could ask the user to repeat the answer if the confidence level is below 0.8. The range of the value is between 0.0 (no confidence) and 1.0 (full confidence).
utterance is a string holding the tokens (words) recognized by the speech recognition engine, for example "jump to contact".
interpretation is an object holding the result of semantic interpretation from the grammar.

For this tutorial we will use the convention that the interpretation object should always have the action property. We can later add properties specific for each type of action.

action = 'load-page' means go to page. From the grammar semantic interpretation we also include the page property. The above script sets document.location = si.page; which is exactly what we want: to load another page. The return value is an empty string, because we want the VoiceXML form to finish without doing anything else.
action = 'prompt-element' tells the script to read a part of the document using the element ID. Here we also have a new src property providing the ID of the element we want. As you can see, what we return the SI object to the form. Remember that we wanted to be able to do voice-related things. The script itself cannot do any page reading, therefore we return to the VoiceXML form.

That's all! The script is not a big deal.

The style

We will add a small speech style sheet for the text to be spoken.

head form {
  display: none;
}

h1, h2, h3, h4, h5, h6 {
  voice-family: female;
  pause: 2s;
}

div, tr {
  pause: 1s;
}

li, th, td {
  pause: 500ms;
}

#content {
  pause-before: 2s;
}

#menus1 a:before, #menus2:after, #menus2 li:before {
  speak: none;
}

Nothing too fancy here. We have added some pauses, made sure the headlines are spoken in a female voice, and avoid speaking the menu numbers. The form used in the examples is designed to have no textual content at all, with all its spoken contents retrieved from the script file, or the page itself. However, it may be given default styles by the browser, and at some point you may wish to put textual content inside it. The rule head form {display: none} ensures the voice form content will not display in CSS supporting browsers.

Putting it all together

You need to add the CSS and the script in the <head> section, change the DTD from XHTML to XHTML+Voice, add the XML namespaces, add an XML Event for the <body> tag:

<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
  "http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml"
  xmlns:ev="http://www.w3.org/2001/xml-events"
  xmlns:ss="http://www.w3.org/2001/10/synthesis"
  xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
  <head>
    ...
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" type="text/css" href="example2.css" />
    <script type="text/javascript" src="example2.js"></script>
    <form xmlns="http://www.w3.org/2001/vxml" id="readpage">
      ...
    </form>
    ...
  </head>
  <body ev:event="load" ev:handler="#readpage">
    ...
  </body>
</html>

The XML Event activates the VoiceXML form on page load.

Test the example.

Normal execution goes like this:

The user loads the page.
On load, the page title is read.
The VoiceXML form waits for a voice command.
If the user asks for help, the help message is played.
If the user says nothing at all, in the first seconds/minutes: the noinput message played.
If the user says something, but it's not matched by the grammar, the nomatch message played.
On succes, the resulting utterance and semantic interpretation is sent to a JavaScript function.
Based upon the result, the JavaScript function can do whatever you want. The need for sending the result to a JavaScript function is simple: you cannot do everything you want from a VoiceXML form.
If there is something you want to do, but cannot do so, from JavaScript, just return an object, or a value. From VoiceXML, with some procedural logic elements, you can do whatever you want in relation to voice interactivity.

A summary of the changes made to the initial code:

Converted the HTML document to XHTML.
Switched to the X+V DTD.
Added the needed XML namespaces for XML Events, SSML and X+V.
Added the VoiceXML form with its associated grammar file.
Added a script and a CSS.
Added a single XML event for activating the VoiceXML form on page load.

The code presented here is extendable. For a local experiment I have added page-specific JavaScript actions, page-specific grammars, and page-specific VoiceXML forms. The above example you see is the "parent" of that voice-enabled site. This is reusable code you can copy/paste in your site and update the grammar accordingly.

Final words

You are now able to add Voice to an entire site: easily, fast, in a reusable manner, and maintaining compatibility.

You can try the example site. For the front page there is a new action for loading the Nth news page. There is also a voice command for saying the access key, like "press key 2". Take a look at the JavaScript and the grammar file, for the front page.

Also, try the administration module which allows you to add new pages and news items in the site. Grammars are dynamically generated by the server-side scripts.

Implementation notes

Voice has been available in normal Opera releases since the technical previews of version 7.6. The early implementation was buggy, but now things are much more stable.

One of the biggest issues is missing support for VoiceXML DOM. You cannot modify the VoiceXML forms from JavaScript. The example script attempts to workaround by storing the strings in JavaScript. The actual form logic remains the same and cannot be changed by JavaScript, but the strings are no longer hard-coded.

Another marginal issue, not related to Voice alone, is you cannot have an ID attribute set for the <title> element. You can use it, but the document will not validate. This is because XHTML is an XML-based reformulation of HTML which itself did not allow having an ID. Because of this, you cannot have your VoiceXML prompts directly read the page title - you have to work around the limitation using JavaScript.

Lastly, you cannot put VoiceXML messages in the page itself, because legacy browsers with no CSS support will show them. The work around, which actually can provide advantages as previously explained, is to use JavaScript.

Authoring X+V

There is no specific tool for doing editing X+V, however there is no great need for such. Any (X)HTML/XML editor is more than satisfactory. For Linux or UNIX, you may want to use Quanta+ (this tutorial was made with it). For Windows you could try an editor such as UltraEdit. For Mac, BBEdit is an alternative.

If you are a web developer not using Windows, you can still work with Opera and Voice. You have two choices: WINE may be available for your operating system, or you can use a virtual computer such as VMWare Player or VirtualPC. I personally use WINE for most of the work, at the end I just check the results in VMWare. This is because WINE is quite noticeably faster and more convenient to use.

XHTML+Voice 1.2

Opera supports all X+V. Note that this XHTML profile does not include support for all VoiceXML 2.0. Specifically, telephony-related features are excluded.

Speech Grammar

Opera supports all of SRGS 1.0. This means grammars in both forms: ABNF and XML. As expected, there's no support for the DTMF mode - only voice is supported.

One rather important bug is that the $NULL special rule cannot be used for now, since it crashes the browser.

Semantic Interpretation

Opera only supports Semantic Interpretation Script Tags, no support for String Literals. Also, the current implementation is slightly outdated, being based on an older working draft. One of the most proeminent difference caused by this: in Opera you need to use $ instead of out.

Java Grammar Format

This grammar format is also fully supported. However, the <NULL> special rule also crashes Opera.

CSS 3 Speech module and Aural CSS 2.1

All CSS 3 Speech properties that are not also available in the Aural CSS 2.1 have to be prefixed with -xv-, since this specification is only a working draft - it's not yet a candidate recommendation.

The current Opera implementation uses an older draft of CSS 3 Speech, this means that the rest-* and mark-* properties are not supported.

Aural CSS 2.1 properties also in CSS 3 Speech and supported by Opera are: cue-*, pause-*, speak and voice-family.

One of the most important implementation surprises is that Opera does not yet support generated content in CSS for the speech media.

An interpret-as property would be very much needed. This would allow web developers to tell the speech engine how to read the text, as a date, as time, etc.

The current implementation does not allow web developers to apply their CSS styling to VoiceXML forms. One must use SSML for this purpose.

Speech Synthesis Markup Language (SSML)

Opera provides support for voice, emphasis, and break.

Greatly needed would be support for say-as (see interpret-as discussion above).

XML Events

Opera provides complete support for XML Events.

Resources

Specifications:

XHTML+Voice 1.2 specification
VoiceXML 2.0 specification
Speech Recognition Grammar Specification
Semantic Interpretation for Speech Recognition specification
Java Speech Grammar Format (also supported by the Opera browser)
Aural CSS 2.1
CSS 3 speech module
Speech Synthesis Markup Language
XML Events

Tutorials, and documentation:

Note: The documentation found on some of these links is valuable, however, each company/corporation has its own extensions. Some of them are not marked as such. Great care should be taken.

Where to ask for help:

Accessibility and voice browsing forum from my.opera.com community
IBM's Multimodal Tools Newsgroup. This is an appropriate newsgroup for posting questions related to Opera, since the voice libraries in Opera are provided by IBM.
Yahoo VoiceXML group
irc.opera.com/voice