Go to previous article
Go to next article
Go to Table of Contents for 1993 Virtual Reality Conference
Greg Bryant, Russell Eberhart, Erik Frederick, John Gawel and
Stephen Turner Electrical and Computer Engineering
North Carolina State University
Raleigh, North Carolina
Research Triangle Institute
Research Triangle Park, North Carolina
The development of a system to implement hand gesture to speech using a Power Glove and an Echo Speech Synthesizer in a standard 386-based PC is described. The incremental cost of the glove talk equipment is about $100: about $85 for the synthesizer and $15 for the glove. Two versions of the system were developed. The first relies on hand roll and finger position with the hand held relatively still for the formation of a root utterance, and hand movement for the selection of an utterance ending and speaking of the entire utterance. The second involves hand movement for the entire gesture. Both versions were implemented using a neural network. The first version was also implemented using more traditional position sensing software.
Recent work by Fels and Hinton reports the development of a neural network interface between a VPL Data Glove and a speech synthesizer . A DECtalk speech synthesizer was used with five neural networks to implement a hand-gesture to speech system. The system demonstrated that neural networks can be used to develop the complex mappings required in a high-bandwidth interface that adapts to the individual user. The exact cost of the hardware used by Fels and Hinton is unknown by the authors of this paper, but is estimated to be in excess of $20,000. (The Data Glove alone costs about $8,000.)
This paper describes a system developed to implement a hand gesture to speech system using a Mattel Power Glove and a Street Electronics Echo Speech Synthesizer used with a standard 386-based PC. The incremental cost of the equipment is about $100: about $85 for the speech synthesizer and about $15 for the glove.
A system chosen to implement hand gesture to speech can be designed many ways depending on the control resolution desired. For example, each gesture could be mapped to a phoneme. The system user would then theoretically have an unlimited vocabulary, but the sequence of gestures would have to be made rapidly, and the system that recognizes them would have to operate in real-time in order to avoid unacceptable lag time.
At the other extreme, each hand gesture can correspond to a complete word or phrase. This results in a fixed vocabulary, but a relatively small number of utterances can be learned in a reasonably short period of time. Our $100 Glove Talk system uses this second approach.
The system development was carried out as a senior design project in the Electrical and Computer Engineering Department at North Carolina State University. The work was sponsored by Research Triangle Institute. A total of three months was available to interface the glove with the computer and design/test/debug the software.
The following system goals were established:
The Nintendo Power Glove was introduced in 1989 by Mattel for use with the Nintendo Entertainment System. It was designed to be an alternative input device and to play video games. It was unpopular, due largely to its cumbersome design, and was discontinued. These gloves are still available by mail order, and from other sources such as flea markets and pawn shops. The project team acquired several gloves at Raleigh-area flea markets for $5-14 each. Some gloves were new, some used. The used gloves were in various conditions. Of the seven gloves acquired, two have ceased to work properly.
The Power Glove has ultrasonic transducers that triangulate spatial position and hand rotation relative to a center position that the user sets by pressing a button on the back of the glove. This position and rotation information is processed by the glove and transmitted serially to the PC via the "select" line of the parallel port. The signals from the glove are interpreted by driver software adapted from REND386, a public-domain 3-D rendering package for the Intel 80386-based PC.
During the initial stages of the project, it was learned that although the glove can recognize four positions for each finger, the two intermediate positions are difficult to sign consistently. It is also difficult for many people to use the ring finger, because the finger is relatively weak. It was therefore decided to create a set of gestures using the thumb, forefinger, and middle fingers in the fully open and closed positions. In addition, five rotations of the wrist were used, from palm straight down to palm straight up (45-degree increments). This method allows for 40 unique gestures, each related to a "root" word or phrase.
The X, Y, and Z positions of the glove were used to attach supplemental phrases to the base word selected by finger position and rotation. The negative X, negative Y, negative Z, positive X and positive Y positions each represent a unique word or phrase. The positive Z position, which is in the backward direction, was not used because of motion limitations of a wheelchair occupant. This approach resulted in five endings to add to each root word/phrase, yielding a total of 200 unique utterances.
The system was designed with a minimum of hardware, and uses an 80386-based personal computer. The glove unit consists of a nylon glove with rubberized backing. The backing contains strain gauges. A microcontroller mounted on the back of the glove converts raw finger position information into two bits of data per finger. The Power Glove has sensors in all fingers except the little finger, so that eight bits (one byte) of data is obtained each sample for finger position.
X,Y,Z coordinate information and hand rotation are also sampled by three ultrasonic receivers which are built into a plastic L-shaped rod that can fit on a computer monitor. The signals are generated by two ultrasonic transducers mounted on the back of the glove. This method allows the coordinates and roll angle of the glove to be triangulated and tracked in near-real-time. X and Y coordinates are calculated to a precision of about 3 mm; the Z coordinate to a precision of about 14 mm. These values increase with the distance from the glove to the receivers. Roll angle is calculated to a precision of 30 degrees. A centering button on the back of the glove can be used anytime to reset the current glove position to 0,0,0 in a Cartesian system with limits of +128 to -128 for each axis.
A modified Nintendo extension cable was used to interface the glove unit to a personal computer through the parallel printer port . The control line of the printer port is set up to receive the data stream from the glove unit. The software must therefore poll the printer select line to receive glove unit data. This method is a CPU-intensive but inexpensive alternative to an external microcontroller.
Speech output was obtained two ways. The first was a text to speech board called the Echo Speech Synthesizer manufactured by Street Electronics. Two versions of the synthesizer were used: the Echo PC II, which features a plug-in board for a PC and an external speaker; and, the Echo, which is an external unit which connects to the PC's serial port. The second way was a SoundBlaster board manufactured by Creative Labs. The SoundBlaster is more expensive than the Echo, but can provide more versatility in voiced utterances and sound effects.
The only additional hardware required is a five-volt power supply for the glove unit. This supply was obtained one of two ways. On desk-top PCs, a tap was used to obtain the power from the keyboard's voltage lines. On battery-powered notebook PCs, a battery pack of four 1.25-volt rechargeable NiCad AA cells was used, tapped into the glove cable at the computer's parallel port connector. It should be noted that, with the battery pack, and a 9-volt battery providing power to the external Echo unit, the entire system is battery-powered and portable.
The batteries and battery holder cost about five dollars. Given a price of $85 for the Echo unit and $5-14 for the glove, the total price for the system is about $100.
The system software includes four main modules: the glove driver, the gesture interpreter, the user interface and the speech driver. Additionally, a configuration file, named glove.cfg, contains the configuration information.
The glove driver software was adapted from a public domain software package called REND386 which is a 3D rendering package for Intel-based PCs. REND386 is available on the Internet. The output
of this module delivers eight bytes of information. The first five bytes, important to this application, are in the following format: Byte 1, glove X position; byte 2, glove Y position; byte 3, glove Z position; byte 4, glove roll angle; and, byte 5, finger positions. The sixth byte is related to key presses on the glove; the seventh and eighth bytes are cryptic "Nintendo" bytes that we neither use nor care about.
Two versions of the interpreter were developed. The first relies on hand roll and finger position with the hand held relatively still for the formation of a root utterance, and hand movement for the selection of an utterance ending and speaking of the entire utterance. The second involves hand movement for the entire gesture. Both versions were implemented using a neural network. The first version was also implemented using more traditional position sensing software.
The neural network is a standard back-propagation network with one hidden layer as described in Eberhart and Dobbins . For the first system version (with the hand held relatively still for the root utterance formation) the network has inputs for finger positions and roll angle. For the second version (with hand movement for the entire gesture), the network has inputs for X, Y, and Z coordinates, finger positions and roll angle for each of 10 time slices separated by 100 milliseconds. The gesture time duration is therefore set at one second.
A simplified interpreter was written for the first system version that features a direct translation of the glove data received from the driver module mapped onto a user-defined utterance table in the glove.cfg file that contains a list of root utterances and their related endings. The gestures are mapped and distinguishable by the positions of the user's fingers and glove rotation data. This gesture information is then fed into the user interface module.
In the first version of the system, the software waits for the user to either select a new root utterance or choose an utterance ending by moving the glove out of a visual representation of a bounding volume. An ending can be selected by moving the glove approximately 20 cm in one of five directions: up, down, right, left or forward. The backward direction is not available, to better accommodate wheelchair users. When the glove is moved in one of these directions outside of the volume, the root phrase and selected ending phrase are sent to the speech output module.
In the second (3D gesture) version of the system, the software waits for a clenched fist. Following the unclenching of the fist, ten sets of glove data (XYZ, roll, and finger positions) are taken at 100-millisecond intervals. These data are then stored, and used as input to the trained neural network, which classifies the gesture. When the user makes a pointing gesture (thumb up, index finger pointed forward), whatever utterances have been recognized by the neural network are spoken (sent to the speech output module. Gestures based loosely upon American Indian Sign Language have been used. American Sign Language was not used for several reasons, the main being the inability of the Power Glove to distinguish relatively small finger position differences.
The system software was configured so that speech output can be produced by either a Street Electronics Echo Speech Processor or a Creative Labs Soundblaster. Each of these units is capable of text to speech conversion. The module speaks the phrase gestured, then returns control to the interpreter.
The version of the Echo which has a plug-in board for the PC has two voices available: a "robot" voice with unlimited vocabulary, and a woman's voice with a fixed 1,000-word vocabulary. The woman's voice is easier to understand, but may or may not have the required phrases for a given user. The external version of the Echo that plugs into the serial port has only the robot voice. The Soundblaster has more flexibility than the Echo. For example, it provides a sound-effects capability and relatively high-quality speech synthesis.
The configuration file, named glove.cfg, is an ASCII file that can be edited by any text editor. With it, the user can configure the system without rewriting any of the program code. The file is read each time the system is activated. Items that can be specified include the speech output device being used, the communications port used by the speech device, the size of the 3D box within which the glove must be held, and the vocabulary list of utterances. A example listing of the glove.cfg file appears in Appendix A.
Go to previous article
Go to next article
Go to Table of Contents for 1993 Virtual Reality Conference
Return to the Table of Proceedings