Go to previous article
Go to next article
Go to Table of Contents for 1993 Virtual Reality Conference
Gregory B. Newby, Ph.D.
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Urbana, IL, 61801
One of the advantages of virtual reality interfaces over traditional interfaces is that they enable the user to interact with virtual objects using gestures. This use of natural hand gestures for computer input provides opportunities for direct manipulation in computing environments, but not without some challenges. The mapping of a human gesture onto a particular computer system function is not nearly so easy as mapping with a keyboard or mouse. Reasons for this difficulty include individual variations in the exact gesture movement, the problem of knowing when a gesture is started and completed, and variation in the relative positions of other body parts which might help to identify a gesture but may not be measured. A further difficulty is the potential limitation on the number of gestures which a person can reliably remember and reproduce. This paper describes work on the automatic recognition of gestures using the sum of squares statistic. A DataGlove(tm) and Polhemus(tm) tracker were employed to measure hand location and finger position to "train" the software to recognize letters and numbers of the manual alphabet of American Sign Language (ASL). Benefits for ASL speakers and VR application designers are discussed, and future directions for gesture recognition research are introduced.
Several aspects of virtual reality (VR) differ from other forms of human-computer interaction. One of the most important is the use of gesture-oriented input devices. Tradutional input devices (e.g., keyboards) do not use gestures. Mice and lightpens allow for two dimensional input (plus a button to depress), and enable a limited set of gestures for input. VR input devices have gone beyond the limits of previous generations of input devices in two ways. First is the enablement of input through body movement: body position, hand and finger position, or the orientation of a particular body part can be assessed. The second and perhaps more important way in which VR devices differ from traditional input devices is that they enable the computer to recognize human gestures. This is in direct contrast to other forms of human-computer interaction which require the user to translate what he or she wants to do into a set of input sequences which the computer can understand . The use of gestures, especially gestures involving the hands, is the focus of this work. Glove devices which measure finger positions and trackers to assess the position and orientation of the hands, head, and body are basic interface components of gesture input devices.
Speakers of American Sign Language (ASL) employ gestures to communicate with each other to a degree that far exceeds that of spoken English (hereafter, referred to as "English" ). ASL is primarily used for communication among deaf people, or between the deaf and non-deaf, and does not make use of sound, which is the primary component of spoken communication among English speakers. ASL speakers use hand movements, augmented with body movement and facial expression, in place of sound, to a far larger extent than English speakers do.
The purpose of this work is to investigate the ability of a computer to recognize gestures such as produced by speakers of ASL. Various methods for recognizing gestures are discussed, and the role of gesture recognition for human-computer interaction and for ASL speech recognition is presented. A small-scale study to recognize letters of the manual alphabet of ASL is described. Future directions for gesture recognition research compose the concluding sections. Although this work is presented in the context of ASL, the use of gestures for interaction in virtual environments is often fundamental. These techniques may be used in virtual environments as well as for the recognition of ASL gestures.
Goals For Gesture Recognition
Two general purposes of gesture recognition for ASL are (1) as input to computer applications and (2) for translation into spoken or written English. The second purpose assumes the first, and is more difficult. The first case is the more general, and may be applied for VR or non-VR applications as well. The most frequently discussed VR applications include a very small number of available input choices -- grabbing and pointing are the only gestures used in many "fly through" virtual worlds (e.g., Newby, 1993). Additionally, graphical scenes may be based the position of the user's head. In contrast to a typical computer application, the number of discrete types of input which a VR application allows is limited. On a word processor, for example, about 100 keys may be pressed, often in combination, and a mouse may also be used to modify the input. Pull-down menus provide further options. Even a simple program that seems appropriate for implementation in a virtual world -- a paint or draw application, for example -- requires a number of gestures to grab and release tools, apply/use them, select areas, and so forth.
Grabbing and pointing are typically recognized using the fixed-parameter method described below. For more complicated applications, methods for distinguishing among a larger number of gestures are needed. ASL provides a good model for VR interaction as it includes thousands of gestures which are well-suited to measurement with VR input devices. The remainder of this section discusses ASL as a method for communication, starting within the context of the second role from above: the automatic generation of English translations from ASL input.
The grammar of ASL is generally simpler than that of English (American Standard or any other dialect). For example, modifiers for tense (past, present, future) occur before a phrase is signed, and stay in effect until another modifier is given. In this example, the single sign for "to go" could mean "will go," "have/has gone," and "am going," depending on whether a modifier for future, present, or past tense is present .
ASL phrases leave out articles and other parts of speech, e.g., "a," "an," "the," "of." The placement of modifiers, lack of articles, and use of multi-part signs contribute to the lack of isomorphism between ASL phrases and their English counterpart. For example, a gesture-by-gesture translation of an ASL phrase into English might yield: "future I go again eat here."
The equivalent English phrase would be: "I will eat here again." More complicated ideas expressed in ASL may be further removed from their expression in English. Although direct translations are not very difficult for people knowledgeable in both languages, the creation of computer algorithms to translate between the languages, so that a gesture recognition program could be used to produce synthesized English speech, would involve more than simple one-to-one matching of ASL signs and their English counterparts. The remainder of this work is concerned with the identification of ASL signs, not their translation into another language.
There are three components to an ASL sign: the location relative to the body, the finger position, and the movement. Many are made with both hands and arms. A large number of common English words or ideas have single ASL signs, enabling fluent signers to speak at rates somewhat faster than spoken English. When a word does not have a sign or the sign is not known, it is spelled using the manual alphabet. This happens frequently with proper names and technical terms. Many signs include components from other signs, such as the sign for "green" which starts with the hand position for the letter "G" but has added movement.
Signs in ASL are made relative to body reference points: parts of the head and face, the arms, the shoulders, and torso. Some two-handed signs are symmetric, others involve different movements with each hand. Signing also involves great involvement of facial expression to support the communication or add emphasis. Signers may mouth words silently in English as they speak ASL.
Methods For The Automatic Recognition Of Gestures
For ASL speakers, the dominant communication component is the gestures made while speaking. In virtual environments, gestures are used to navigate through a representation of a physical or physical-like domain, or to interact with applications. Both ASL speakers and VR users can benefit from effective methods for a computer to recognize gestures, and learn to distinguish among gestures. This section discusses two computer methods and their associated issues: fixed-parameter recognition and sum of squares statistical recognition.
"Fixed-parameter" is used here to refer to gesture recognition in virtual environments which set up a list of parameters for the input measures, and compare the current input list to those parameters. A typical VR application might involve using these gestures: a fist, an open hand, and pointing with one or two fingers. The DataGlove produces two numbers for each finger . Fixed-parameter recognition would set up a list of minimum or maximum values for fingers which distinguish these gestures. So, a fist would involve high values on every finger; an open hand would have low values on every finger; pointing would have high values on some, and low on others.
The problems with the fixed-parameter approach are that it is not well suited for large number of gestures and the parameters may be different for different users (or even the same user at different times). The main advantages are the conceptual simplicity and computational efficiency. This is the method used in most VR applications, especially those that involve "flying through" a virtual environment using finger gestures to indicate direction and velocity and grabbing gestures to select items.
A somewhat more sophisticated approach to gesture recognition is the measurement of similarity by the "sum of squares" method. The sum of squares is one of the simplest statistics, used for assessing the similarity of two sets of scores (e.g., Tukey, 1977). A sum of squares score is calculated by squaring the differences between each matched value in the sets, and then summing these squared difference scores.
With the DataGlove, this may be accomplished by obtaining "prototype" gestures for each gesture to be recognized. Then, a gesture to be recognized is compared to each known prototype. The known gesture with the smallest sum of squares score is the one "closest" to the gesture to be recognized. The DataGlove used for this experiment produces 10 values, 2 for each finger. The sum of squares for each known gesture is calculated by taking the difference between the current gesture and the known gesture on every DataGlove value, producing 10 difference scores. Each score is squared, then the squared values are added. The one number resulting is the um of squares for that known gesture and the current gesture.
In the case where the known gesture is identical for all 10 finger values, the sum of squares will be 0. As the two gestures differ, the sum of squares increases. The set of sum of squares scores can be ranked to see which gesture is closest -- and a value for tolerance for error may be used to cull sum of squares scores which are too large to be a "good" match.
The benefit of the sum of squares approach is that it can distinguish a far larger number of gestures than the fixed-parameter method. As long as the DataGlove values of new gestures are different from each of the known gestures, new values may be added indefinitely. Different users can provide their own prototype gestures, so that differences across people will not detract from the accuracy (this could be done for fixed-parameter methods as well, but in practice seldom is).
The main problem with the sum of squares method is its computational complexity. Even though this is a far simpler statistic than most, it does involve several dozen floating point operations for each known gesture, followed by ranking and comparison to the tolerance for error. To be effective for gesture recognition in virtual environments, this must be accomplished for each incoming set of DataGlove values as they become available, in "real time." A lower threshold for VR is about 10 "frames" per second -- so the calculations should take no longer than 1/10 second, including any time needed to get current DataGlove position values. The software developed for the experiment described below was able to operate within these parameters. However, it will be desirable to add a movement tracker, as well as finger position measurement, to make the transition beyond the manual alphabet to ASL proper. As a standard Polhemus tracking device produces only about 6 measurements per second, this may limit performance at that stage of the research. Faster tracking devices may eliminate this problem.
The sum of squares approach described above is well-suited for comparing "slices" of time -- comparing one set of DataGlove values to another, where the sets are the same size. ASL signs take place across time, though, which means that a program capable of recognizing ASL during actual use would need to match a set of DataGlove scores representing a sign (say, sampled at 60 measures per second) with an ongoing stream of incoming data. The sum of squares method is suitable for this: the known gestures would be stored in an array of DataGlove values, and compared to an array of the same size taken from the most recent input values. Implementation is not so straightforward as for the static measures, especially for guarding against false matches as would be likely to occur when the start of one gesture matches all of another.
Other forms of gesture recognition could be used. One of the most promising may be a neural network approach (cf. Rumelhart & McClelland, 1986). Benefits of this approach may be anticipated based on success in other areas, and may include automatic identification of the critical features for discriminating among similar gestures and increased capability to employ the transition between known gestures for recognizing dynamic gestures. Neural networks do not decrease the difficulties with dealing with gestures across time, though, and they would almost certainly not be as fast as the sum of squares approach. More importantly, the addition of a new gesture would require retraining for the entire network (something that takes at least several seconds in more neural network implementations), as would removing a gesture.
Experiments On The Automatic Recognition Of Gestures
Results are presented for Experiment 1. Experiment 2 is pending.
The goal of these experiments is to assess the effectiveness of the statistical approach to gesture recognition described above for signs which do not occur across time.
Setting: The Virtual Reality laboratory of the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign.
Equipment: The input device used was a VPL DataGlove model II. The glove has two sensors to measure finger bend on each finger, one over the knuckle and the other over middle joint of the finger (no sensors measured the palm, wrist, or last finger joint positions).
Software: A Silicon Graphics SkyWriter(tm) graphics supercomputer received input from the DataGlove and performed all calculations. The SkyWriter uses a flavor of the UNIX(tm) operating system. Gesture recognition software was written in the C programming language (an earlier version was written in Fortran) . Additional software for interaction with the DataGlove and VR devices were used by the gesture recognition code .
Subject: A native sign language speaker volunteered to assess the performance of the recognition software. She also taught ASL, and was fluent in spoken English .
Method for Experiment 1: The subject generated template gestures for the 26 letters of the ASL manual alphabet and the numbers 1 through 10. The software then attempted to match gestures made with those it knew about. Different values of tolerance for error could be selected so that the software would "guess" a gesture even though a perfect match was not obtained.
Method for Experiment 2 (planned): Instead of gestures, words will be spelled with the manual alphabet. Unlike the gestures investigated in Experiment 1, these words occur across time. Rather than comparing a given gesture to all known gestures, an array of current and past gestures will be compared to known arrays.
Experiment 1 demonstrated the promise of the statistical approach to gesture recognition for gestures which do not occur across time. Of the 36 gestures "taught" to the software, only a few were not recognized reliably and unambiguously. "I" and "J" were confused, as they are identical hand positions but "J" moves. "Z" and "1" are also identical, except that "Z" moves and has a different orientation. The addition of the capability to measure hand orientation and take measurements across time will provide data to disambiguate these problematic pairs, but introduce other problems discussed below.
Performance of the software was adequate for real-time recognition, suitable for use in VR applications. However, it was only barely fast enough for use by native ASL speakers, who use the manual alphabet at a rate of about ten letters per second. The software could be optimized somewhat, and a dedicated processor used, to remove some of the lag. There will necessarily be some delay for obtaining data from the DataGlove, though, and the methods for fine-tuning the algorithm (below) will result in extra time being taken to disambiguate similar gestures. It seems likely that this approach will require the native ASL speaker to sign somewhat more slowly than normal for optimum recognition using these methods, given current technology.
Enhanced Methods For Gesture Recognition
The sum of squares method used here works very well for static gestures which have distinguishing features in finger position only. It may be guessed that the addition of a positional indicator will increase the performance for gestures in which finger positions are similar but the orientation is different, even without requiring across-time measurement. A complication caused by current VR equipment is that the scales on the Polhemus tracker and VPL DataGlove are not the same, so the numbers from one would need to be scaled to match the range of the other (e.g., the Polhemus tracker can produce numbers ranging from 0-180 degrees, while the DataGlove values as used here range from 0-2 per finger).
The library of computer functions developed for this research may be incorporated into current VR applications with little difficulty, or the simple sum of squares algorithm may be written from scratch. Enhancements to the sum of squares approach will take two directions. First is the identification of relations among gestures, so that similar gestures may be distinguished. Second is the implementation of the method for gestures which occur across time.
Relations among gestures can help augment the monolithic sum of squares statistic generated by the method described above. The basic approach is to assess a similarity score for each known gesture to the current gesture. In practice, though, some gestures are similar. Additionally, the input gesture may be ill-formed and produce measurements that diverge from the prototype "known" measures. In this case, a second pass at data analysis may focus on the factors that distinguish pairs of known gestures from each other. A simple method would be to create a set of between-gestures difference scores for the two candidate gestures for each matched finger value. Then, each difference score between the candidate gestures and the current gesture would be multiplied by the between-gestures score for that finger value, thus inflating scores which helped to distinguish between that pair of candidate gestures. This would need to be completed for every pair of candidate gestures which fell under the threshold for error -- 3 candidates would require 3 passes, 4 candidates would require 6 passes, etc.
A more sophisticated analysis, but one which may be more computationally efficient, would be to investigate the most common factors in the set of finger scores. Table 1 shows the outcome of one such analysis, in which similarities among letters of the manual alphabet were assessed through a statistical procedure called principal components analysis. The multidimensional space that emerges from such an analysis rates the most important dimensions in the data (which are akin to factors from another statistical procedure, factor analysis). Then, only the relative placement of known gestures on the most important dimensions need to be taken into account, rather than all possible combinations as with the method described in the previous paragraph. For a set of only a few dozen gestures the effort involved in completing the principal components or other analysis (which does not occur in real time) would probably not result in greatly increased performance. However, significant computational time may be saved by eliminating, say, 1/2 of all comparisons when hundreds of gestures are involved.
Future work on gesture recognition will necessarily go into far more complicated environments than those described in this work: the use of two hands for signing, individual differences for signing, and they tracking of head and body movements are all necessary components for effective human communication with ASL. The outcome of research on the automatic processing of ASL speech is not clear at this time -- it is possible that statistical methods will eventually prove too computationally intensive, or too prone to error, for use in real-time ASL conversations. The outcome for general human-computer interaction in virtual environments shows more assurance of success. VR applications which have previously employed only a small number of gestures, or have had difficulty with incorporating gesture recognition based on particular users can benefit immediately from using the sum of squares or similar statistical approaches.
Go to previous article
Go to next article
Go to Table of Contents for 1993 Virtual Reality Conference
Return to the Table of Proceedings