Tutorial: How to make a Voice controlled SVG game

Introduction
Adding Voice interaction
The result
Implementation notes

Introduction

This tutorial shows how to add Voice control to SVG, using the example of a game.

The game I have built displays several objects on screen, and the object of the game is to make all the objects have the same shape and color - you can do this by issuing voice commands ot the page, such as "change the blue squares to red triangles" or "switch the yellow shapes to circles". Once a player does so they win and the game ends. The game allows you to change various parameters such as the number of objects and players.

For testing purposes during development, the game included a graphical user interface for allowing game options to be shifted during a game. You can enable it by editing the relevant script - look for the showFormControls variable inside the JavaScript file.

The SVG game works in Firefox 1.5+ and Opera 9+. To control the game using your voice you need to use Opera for Windows.

Adding Voice interaction

The entire voice interaction logic is contained inside the VoiceXML form found inside the the-game.xhtml file, and the grammar file used for voice recognition - the-game.gram. Depending on what is said to the application, application-specific JavaScript functions contained within the JavaScript file (the-game.js) will be called by the VoiceXML form to control the game.

You can add VoiceXML forms in XHTML documents, or directly in standalone SVG documents. In both cases, you have to add the required XML namespaces. For XHTML documents it's recommended that you also switch to the XHTML+Voice DTD.

To better understand some of the basics behind this tutorial, read some of the previously published tutorials about Voice on dev.opera.com.

You can download all the example code and other assets required to run this game here.

The VoiceXML form

The main page for the game is an XHTML document containing an inline SVG image and the VoiceXML form. The place where you put the VoiceXML form does not matter much. However, common practice is to keep your VoiceXML form inside the <head>.

The code:

<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN"
  "http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">
<html lang="en" xmlns="http://www.w3.org/1999/xhtml"
  xmlns:ev="http://www.w3.org/2001/xml-events"
  xmlns:ss="http://www.w3.org/2001/10/synthesis"
  xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
  <head>
	...
	<form xmlns="http://www.w3.org/2001/vxml" id="vmain">

	  <block>
		<prompt xv:src="#ttl" />
	  </block>

	  <field name="vcmd">
		<grammar type="application/srgs" src="the-game.gram" />

		<prompt timeout="10s">
		  <ss:break time="5s" />
		  <value expr="RD_game.voiceMessage('prompt');" />
		</prompt>

		<prompt count="2" timeout="500s">
		  <ss:break time="200s" />
		  <value expr="RD_game.voiceMessage('prompt');" />
		</prompt>

		<catch event="help nomatch noinput">
		  <value expr="RD_game.voiceMessage(_event);" />
		  <reprompt />
		</catch>

		<filled>
		  <assign name="vres" expr="RD_game.voiceCommand(application.lastresult$);" />
		  <if cond="vres == 'event-nomatch'">
			<clear namelist="vcmd" />
			<throw event="nomatch" />
		  <elseif cond="vres.action == 'prompt-value'" />
			<value expr="vres.message" />
			<clear namelist="vcmd" />
		  <elseif cond="vres == 'clear-cmd'" />
			<clear namelist="vcmd" />
		  </if>
		</filled>

	  </field>
	</form>
  </head>
  <body ev:event="load" ev:handler="#vmain">
  ...
  </body>
</html>

The VoiceXML form is activated by the XML event listener attached to the <body> element once the document loads. The form reads the content of the element with the ttl ID. The prompt of the form field vcmd synthesizes a message which is then returned by the voiceMessage function. This allows the game to synthesize the name of each player for each game turn.

Once the user inputs a recognized voice command, the resulting object is then passed to the voiceCommand function, which uses the semantic interpretation of the user's voice input to determine what to do next. The function checks if the move requested by the user is permitted given the current game state. Depending on the results of the checks, the move will be made, or a message will be returned telling the user to try again. All the messages are synthesized by the VoiceXML form.

The grammar

The code for the grammar file is as follows:

#ABNF 1.0 utf-8;

language en;
mode voice;
tag-format <semantics/1.0>;
root $command;

$src_shapes =
  object   {$ = -1;} |
  circle   {$ =  0;} |
  square   {$ =  1;} |
  triangle {$ =  2;};

$src_colors =
  red    {$ = 0;} |
  yellow {$ = 1;} |
  blue   {$ = 2;};

$dest_shapes = $src_shapes {$ = $$;};
$dest_colors = $src_colors {$ = $$;};

public $command = change [the]
  [$src_colors  {$.src_color  =  $src_colors;}]
  [$src_shapes  {$.src_shape  =  $src_shapes;}] to [a]
  [$dest_colors {$.dest_color = $dest_colors;}]
  [$dest_shapes {$.dest_shape = $dest_shapes;}];

As seen above, the user can issue voice commands like "change the blue squares to red triangles" or "switch the yellow shapes to circles". The result is just a simple JavaScript object, for example {'src_color' : 2, 'src_shape' : 1, 'dest_color' : 0, 'dest_shape' : 2}.

Notice that the SI tags are within the square brackets, so they are optional as well. If you only had the square brackets around the included grammar rules (eg [$src_shapes]), the SI tags would always execute, even if the user said something not recognized by the referenced grammar rule. Additionally, the execution of the code would cause a JavaScript error (variable undefined).

Readers who already have some experience with VoiceXML will notice something interesting in this grammar file: $dest_shapes and $dest_colors are the same as $src_shapes and $src_colors respectively. The immediate question asked is: why not only use two grammar rules $shapes and $colors? It is correct to simply reuse your grammar in common scenarios. However, in the current case, you want the SI object to contain two different values, for source and destination - each being optional. To explain this, we will take a look at two code samples, examining how they work.

What follows is an example of a grammar rule that does not work as desired:

public $command = change [the]
  [$colors] {$.src_color  = $$;}
  [$shapes] {$.src_shape  = $$;} to [a]
  [$colors] {$.dest_color = $$;}
  [$shapes] {$.dest_shape = $$;};

The above will not work for voice commands like "change blue to yellow", because using $$ (the latest matching rule) in the SI tag will result in an SI object with $.src_shape = $.src_color = blue and $.dest_shape = $.dest_color = yellow.

Another example that does not work is the following:

public $command = change [the]
  [$colors {$.src_color  = $colors;}]
  [$shapes {$.src_shape  = $shapes;}] to [a]
  [$colors {$.dest_color = $colors;}]
  [$shapes {$.dest_shape = $shapes;}];

Here the problem is similar - if the voice command "change the triangle to yellow" is issued, $.dest_shape will be equal to $.src_shape, because $shapes contains whatever it matched during voice recognition (irrespective of its position in the voice command).

Therefore, we need unique grammar rules for each reference of color and shape within the root rule $command. This is required in order to be able to generate a correct semantic interpretation object, without any duplications or other errors.

The JavaScript code

The entire code of the game is contained within a single object: RD_game. In the first part of the code we have several configuration options, followed by the strings used in the game. Having all the strings in a single place allows us to quickly make additions, such as translating the game to any other language.

var RD_game = new (function (no_autoinit)
{
  var _me = this;

  // ... configuration ...

  _me.messages = {
	// ...
	// Voice-related strings
	'voice-prompt' : 'Player %nr% (%name%) make your move!',
	'voice-help' : 'Just say the objects you want to change, and what you want to do with them. For example: change red circles to blue squares.',
	'voice-nomatch' : 'I did not understand what you want.',
	'voice-noinput' : 'If you do not know what to do, ask for help.'};

What follows is an outline of the public methods and private functions defined in the game object:

  // public methods
  _me.init = function () {  ...  };
  _me.restartGame = function () {  ...  };
  _me.updateDisplay = function () {  ...  };
  _me.makeMove = function (src, dest, who) {  ...  };

  // private functions
  function calculateMinVar () {  ...  };
  function calculateScore () {  ...  };
  function renderBonusCombos () {  ...  };
  function syncSVGnHUD () {  ...  };
  function renderPrevMoves () {  ...  };
  function calculateVariations () {  ...  };
  function findSlots (query) {  ...  };
  function onChange_playerName () {  ...  };
  function onChange_nslots () {  ...  };
  function onChange_cols () {  ...  };
  function repositionSvg () {  ...  };
  function drawSlot (row, col, props) {  ...  };
  function getMsg (id, vars) {  ...  };
  function remChilds (elem, skip, clean) {  ...  };
  function arrToStr (arr) {  ...  };
  function formCommand () {  ...  };

The voiceMessage function is called from the VoiceXML form. The purpose is straightforward: it just returns the string to be synthesized. For the voice-prompt string we dynamically update the player number and name.

  _me.voiceMessage = function (type)
  {
	if(!type)
	  return _me.messages['internal-error'] || 'Internal error';

	type = 'voice-' + type;

	var msg = '';
	if(type == 'voice-prompt')
	  msg = getMsg(type, {'nr' : _me.move_by+1, 'name' : _me.players[_me.move_by]});
	else
	  msg = getMsg(type);

	return msg;
  };

The voiceCommand function is invoked by the VoiceXML form once the user says something that it recognises. The function checks if the semantic interpretation object contains a valid request, after which it calls the general purpose makeMove function. The Voice-related code does not deal at all with the SVG code, nor with the game logic. The value returned by makeMove is used to determine which message to synthesize.

  _me.voiceCommand = function (vres)
  {
	if(!vres || !vres.interpretation)
	  return false;

	var src = {},
	  dest = {},
	  p, found_src = false, found_dest = false,
	  si = vres.interpretation;

	// construct the src and dest objects for makeMove()
	for(p in _me.props)
	{
	  if((si['src_' + p] || si['src_' + p] == 0) && si['src_' + p] != -1)
	  {
		src[p] = si['src_' + p];
		found_src = true;
	  }

	  if((si['dest_' + p] || si['dest_' + p] == 0) && si['dest_' + p] != -1)
	  {
		dest[p] = si['dest_' + p];
		found_dest = true;
	  }
	}

	if(!found_src || !found_dest)
	  return 'event-nomatch';

	var res = {'action' : 'prompt-value'};

	var gres = _me.makeMove(src, dest);

	if(gres == -1)
	  res['message'] = getMsg('move-not-allowed');
	else if(gres == -2)
	  res['message'] = getMsg('no-objects-found');
	else if(gres == -3)
	  res['message'] = getMsg('previous-move', {'nr' : _me.disallow_nmoves});
	else if(gres == -4)
	  res['message'] = getMsg('false-move');
	else if(gres < -4)
	  res['message'] = _me.messages['internal-error'] || 'Internal error';
	else if(gres == 2)
	  res['message'] = getMsg('player-won', {'nr' : _me.winner+1, 'name' : _me.players[_me.winner]});
	else if(gres == 3)
	  res['message'] = getMsg('game-end-maxMoves', {'nr' : _me.maxMoves});
	else if(gres == 4)
	  res['message'] = getMsg('game-end-nochange', {'nr' : _me.maxMovesNoChange});

	if(!res['message'])
	  return 'clear-cmd';
	else
	  return res;
  };

  if(!no_autoinit)
	window.addEventListener('load', _me.init, false);

  return _me; // the end of the RD_game object
})();

There you have it: we only added two Voice-related functions to the initial code.

The result

You can play the game using Voice commands straight away - Try it yourself! As you can see, making the game Voice controllable is really easy. It only takes adding a couple of functions and building the VoiceXML form itself.

Implementation notes

Due to the fact that the SVG game uses animations, making the game usable on Firefox as well as Opera required checking whether the SVG DOM methods and properties specific to animations were available or not. Firefox does not currently support the animation module of the SVG specification. It's generally important to make checks for the presence of the methods and properties you want to use, in order to avoid breakage.

Additionally, while working on the SVG part of the game I found some bugs associated with the <svg:use> element. The result of these bugs causes weird behavior (see the bug reports themselves 265894, 265895, 269482).

While working on the project I was interested in using SVG filters, which are somewhat supported by Opera. It seems the implementation is not working properly in Opera 9.2 - for example Opera 9.2 for Windows sometimes has problems with redrawing the page when the player makes a move. You have to zoom in/out to force Opera to redraw the page. Beta builds of Opera 9.5 have tons of SVG-related improvements. For example, during testing, SVG filters rendered without any errors.

There were no browser-specific issues with adding Voice to the game.