Gurion University of the Negev
f Engineering Sciences
Dept. of Industrial Engineering and Management
An Integrated Project
Multimedia, Machine Vision and Intelligent Automation Systems
Hand Gesture Telerobotic System using Fuzzy Clustering Algorithms
1.1 Description of the Problem
2. Literature Review
2.2 Gesture Recognition
3. System Design and Architecture
3.2 Video Capturing Card
3.3 Web Camera
3.4 Video Imager
4.1 Choosing the Programmi
4.2 Developing a Hand Gesture Language
4.3 Developing and Using Algorithms
4.4 Building Interface with Full Capability
4.5 Building the User
Robot Communication Link
4.5.1 Background on Communication Architecture and Protocols
the Camera into the System
5. Testing the System
5.1 Task Definition
5.2 Experimental Results
6.1 Summary of Results
6.2 Future Work
In today’s world the Internet plays an important role in
everyone’s lives. It provides a convenient way for
receiving information, electronic communication, entertainment and conducting business.
the direct and continuous human control of a
[Sheridan, 1992]. Robotics researchers ar
using the Internet as a tool to provide feedback for teleoperation. Internet
based teleoperation will inevitably
lead to many useful applications in various sectors of society. To understand the meaning of teleoperation,
the definition of robotics is
examined first. Robotics is the science of designing and using robots. A robot is
defined as “a reprogrammable multi
functional manipulator designed to move materials, parts, tools or
specialized devices through variable programmed motions for the perform
ance of a variety of tasks” [Robot
Institute of America, 1979]. Robots can also react to changes in the environment and take corrective actions
to perform their tasks successfully [Burdea and Coiffet, 1994]. Furthermore, all electromechanical systems,
as toy trains, may be classified as robots because they manipulate themselves in an environment.
One of the difficulties associated with teleoperation is that the human operator is remote from the object
being controlled; therefore the feedback data may b
e insufficient for correct control decisions. Hence, a
telerobot is described as “a form of teleoperation in which a human operator acts as a supervisor,
intermittently communicating to a computer information about goals, constraints, plans, contingencies,
assumptions, suggestions and orders relative to a limited task, getting back information about
accomplishments, difficulties, concerns, and as requested, raw sensory data
while the subordinate robot
executes the task based on information received from t
he human operator plus its own artificial sensing and
intelligence” [Earnshaw et al., 1994]. A
can be any machine that extends a person’s sensing or
manipulating capability to a location remote from that person. In situations where it is impos
sible to be
present at the remote location, a
can be used instead. These situations may be caused by the
hostile environment such as a minefield, or at the bottom of the sea or simply at a distant location. A
can replace the prese
nce of a human in hazardous environments. To operate a
technology must emerge to take advantage of complex new robotic capabilities while making such systems
Robots are intelligent machines that provide service for human
beings and machines themselves. They
operate in dynamic and unstructured environment and interact with people who are not necessarily skilled in
communicating with robots [Dario et al., 1996]. A friendly and cooperative interface is thus critical for the
evelopment of service robots [Ejiri, 1996, Kawamura et al., 1996]. Gesture
based interface holds the
promise of making human
robot interaction more natural and efficient.
based interaction was firstly proposed by M. W. Krueger as a new form of huma
interaction in the middle of the seventies [Krueger, 1991], and there has been a growing interest in it recently.
As a special case of human
computer interaction, human
robot interaction using hand gestures, has a number
of characteristics: the
background is complex and dynamic; the lighting condition is variable; the shape of the
human hand is deformable; the implementation is required to be executed in real time and the system is
expected to be user and device independent [Triesch and Malsburg,
Humans naturally use gestures to communicate. It has been demonstrated that children can learn to
communicate with gestures before they learn to talk. Adults use gestures in many situations, to accompany or
substitute for speech, to communicate wit
h pets, and occasionally to express feelings.
, where the user employs hand gestures to trigger a desired manipulation action, are the
most widely used type of direct manipulation interface, and provide an intuitive mapping of gestures t
actions [Pierce et al., 1997, Fels and Hinton, 1995]. For instance, forming a fist while intersecting the hand
with a virtual object might execute a grab action.
is human interaction with a computer in which human gestures, usually ha
are recognized by the computer. The potential and mutual benefits that
offer each other is great. Robotics is beneficial to
in general by providing haptic
interfaces and human factors know
w. Space limitations precluded other interesting application areas, such
as in medical robotics and in microrobotics.
is a younger technology than
, it will take some time until its benefits are
recognized, and until some
existing technical limitations are solved. Full implementation in
manufacturing and other areas will require more powerful computers than presently exist, faster
communication links and better modeling.
Once better technology is available, u
sability, ergonomics and other human
factors studies need to be done,
in order to gauge the effectiveness of such systems.
The fundamental research objective is to advance the state
Telerobotic Hand Gesture
systems. As fo
r all technologies, but more importantly for a much emphasized and complex technology such
, it is important to choose appropriate applications with well
objectives. It is also important to compare the abilities
with competing technologies
for reaching those objectives. This ensures that the solution can be integrated with standard business. The
main objective of this project is to design and implement a telerobotic system using
control, an operator may control a real remote robot performing a task. The operator
can use a unique
Hand Gesture Language
to place the robot desired.
Specific objectives are to:
Develop and evaluate a
Test and validate new strategies and algorithms for image recognition.
Demonstrate the use of the
in a telerobotic system.
2. Literature Review
A telerobot is defined as a robot controlled at a distance by a
human operator. [Durlach and Mavor, 1994].
Sheridan, 1992, makes a better distinction, which depends on whether all robot movements are continuously
controlled by the operator (manually controlled teleoperator), or whether the robot has partial autonomy
elerobot and supervisory control). By this definition, the human interface to a telerobot is distinct and not
part of the telerobot. Telerobotic devices are typically developed for situations or environments that are too
dangerous, uncomfortable, limiting,
repetitive, or costly for humans to pe. Some applications are listed in
[Sheridan, 1992] such as:
inspection, maintenance, construction, mining, exploration, search
and recovery, science, surveying;
assembly, maintenance, exploration,
forestry, farming, mining, power line maintenance;
Process control plants
chemical, etc., involving operation, maintenance, emergency;
operations in the air, undersea, and
patient transport, disability aids, surgery [Sorid and Moore, 2000], monitoring, remote
earth moving, building construction, building and structure inspection, cleaning and
protection and security,
fighting, police work, bomb disposal.
Telerobots may be remotely controlled manipulators or vehicles. The distinction between robots and
telerobots is fuzzy and a matter of degree. Although the hardware is the same or is similar, robots require less
uman involvement for instruction and guidance as compared to telerobots. There is a continuum of human
involvement, from direct control of every aspect of motion, to shared or traded control, to nearly complete
robot autonomy, but yet, robots are perform p
oorly when adaptation and intelligence are required. They do
not match the human sensory abilities of vision, audition, and touch, human motor abilities in manipulation
and locomotion, or even the human physical body in terms of a compact and powerful musc
source. Hence, in recent years many robotics researchers have turned to telerobotics. Nevertheless, the long
term goal of robotics is to produce highly autonomous systems that overcome difficult problems in design,
control, and planning. B
y observing what is required for successful human control of a telerobot, one may
infer what is needed for autonomous control. Furthermore, the human represents a complex mechanical and
dynamic system that must be considered. More generally, telerobots are
representative of man
systems that must have sufficient sensory and reactive capability to successfully translate and interact within
their environment. In the future, educators and experimental scientists will be able to work with remote
of taskable machines via a “remote science” paradigm [Cao et al., 1995] that allow: (a) multiple
users in different locations to share collaboratively a single physical resource [Tou et al., 1994], and (b)
enhanced productivity through reduced travel time
, enabling one experimenter to participate in multiple,
geographically distributed experiments.
The complexity of kinematics must be hidden from the user while still allowing a flexible operating
envelope. Where possible, extraneous complications should be
filtered out. Interface design has a significant
effect on the way people operate the robot. This is born out by differences in operator habits between the
various operator interfaces. This is consistent with interface design theory where there are some g
principles that should be followed, but good interface design is largely an iterative process [Preece et al.,
1994]. With the introduction of web computer languages (e.g., Java) there is a temptation to move towards
continuously updating robot infor
mation (both images and positions). Where possible, the data transmitted
should be at a minimum low bandwidth, but relevant information is much more useful than high bandwidth
irrelevant data. Graphical models of the manipulator and scene are known to impr
ove performance of
telerobotic systems. [Browse and Little, 1991] found a 57% reduction in the error rate of operators predicting
whether a collision would occur the manipulator and a block. Future developments will extend this ability to
view moves via a
simulated model and to plan and simulate moves before submission.
Telerobotics was also been implemented in a client
server architecture, e.g. one of the first successful World
Wide Web (WWW) based robotic projects was the Mercury project [Goldberg et al.,
1995]. This later
evolved in the Telegarden project [Goldberg et al., 1995], which used a similar system of a SCARA
manipulator to uncover objects buried within a defined workspace. Users were able to control the position of
the robot arm and view the sce
ne as a series of periodically updated static images. The University of Western
Australia's Telerobot experiment [Taylor and Dalton, 1997, Taylor and Trevelyan, 1995] provides Internet
control of an industrial ASEA IRB
6 robot arm through the WWW. Users ar
e required to manipulate and
stack wooden blocks and, like the Mercury and Telegarden projects, the view of the work cell is limited to a
sequence of static images captured by cameras located around the workspace. On
line access to mobile
robotics and acti
ve vision hardware has also been made available in the form of the Netrolab project [McKee,
Problems with the static picture can be avoided by using video technology, which is becoming more and
more popular in the Internet domain. Video is one of th
e most expressive multimedia applications, and
provides a natural way of information presentation which results in stronger impact than images, static text
and figures [Burger, 1993]. Because transfer of video demands high bandwidth capacity, different met
video transmission are used. Lately streaming video [Ioannides and Stemple, 1998] applications such as Real
Video, QuickTime Movie and Microsoft Media Player, are commonly used in Internet. These applications
provide the ability of transmitting vid
eo recording over connections with small bandwidths such as telephone
The previously mentioned projects rely on cameras to locate and distribute the robot position and current
environment to the user via the WWW. It is clear that such
an approach needs a high
speed network to
line control of the robot arm. Data transmission times across the world wide web depend heavily
on the transient loading of the network, making direct tele
operation (the use of cameras to obtain robot
position feedback) unsuitable for time critical interactions. Rather than allowing the users to interact with the
laboratory resources directly, as in many examples, users are required to configure the experiments using a
simulated representation (a vi
rtual robot arm and its environment) of the real
world apparatus. This
configuration data is then downloaded to the real work
cell, for verification and execution on the real device,
before returning the results to the user once the experiment is complete.
2.2 Gesture Recognition
Ever since the “put
there” system was demonstrated which combined speech and pointing [Bolt, 1980],
there have been efforts for enabling human gesture input. Many of them investigated visual pattern
interpretation problem. Mos
t of them, however, dealt with sign language recognition problem [Starner and
Pentland, 1995, Waldron and Kim, 1995, Vogler and Metaxas, 1998]. As one of a few notable exceptions, the
overall framework for recognizing natural gestures is discussed [Wexelbl
at, 1995]. His method finds low
level features and those are fed into path analysis and temporal integration step. There were, however, no
discussions about the concrete model for analysis and for temporal integration method of features. When the
are complex, which is continuous and have multiple
strokes, constructing a good analyzer is itself a
big problem. Also, the low
level analysis cannot be fully apart from the high
level model or context. On the
other hand, gestures were used to support comp
aided presentations [Baudel, 1993]. The interaction
model was defined by a set of simple rules and predetermined guidelines.
Any serious attempt to interpret hand gestures should begin with an understanding of natural human
gesticulation. The physiol
ogy of hand movement from the point of view of computer modeling was scoped
[Lee and Kunii, 1995]. A range of motions of individual joints and other characteristics of the physiology of
hands was analyzed by them. Their goal was to build an accurate intern
al model where the paramecould be
set from imagery.
Francis Quek of the University of Michigan has studied natural hand gestures in order to gain insights that
may be useful in gesture recognition [Quek, 1993]. For example, he makes the distinction betwee
obvious (transparent) gestures that need little or no pre
agreed interpretations, and iconic (opaque) gestures.
He observes that most intentional gesticulation generally is, or soon becomes, iconic. Other classifications
are whether a gesture
is indicating a spatial concept or not, and whether it is intentional or unconscious. He
indicates that humans typically use whole hand or digit motion, but seldom both at once. He suggests that if a
hand moves from a location, gesticulates, then returns t
o the original position, the gesticulation was likely to
be a gesture rather than an incidental movement of the hand.
Spontaneous gesticulation that accompanies speech was scoped by McNeill [McNeill, 1992]. He discusses
how it interacts with speech, and wh
at it can tell us about language. A basic premise is that gestures
complement speech, in that they capture the holistic and imagistic aspects of language that are conveyed
poorly by words alone. Most applicable here, he sites work that indicates that gestu
res have three phases:
preparation, where the hand rises from its resting position and forms to the shape that will be used; stroke,
where the meaning is conveyed; and retraction, where the hand returns to its resting position. Cassell [Cassell
et al., 199
4] uses these types of observations about spontaneous gesture to drive the gestures of animated
characters in order to test specific theories about gesticulation and communication. She has also been a part
of gesture recognition projects at the MIT Media L
ab [Wilson et al., 1996].
Until a few years ago, nearly all work in this area used mechanical devices to sense the shape of the hand.
The most common device was a thin glove mounted with bending sensors of various kinds, such as the
DataGlove. Often a magn
etic field sensor was used to detect hand position and orientation. This technique for
having problems in the electro
magnetically noisy environment of most computer labs.
Researchers reported successful data glove type systems in areas such as CAD (Comput
Manufacturing) design, sign language recognition, and a range of other areas [Sturman and Zeltzer, 1994].
However, in spite of readily availability of there sensors, interest in glove
based interaction faded. More
recently the increasing computer
power available and advances in computer vision have come together with
the interest in virtual reality and human centric computing to spark a strong interest in visual recognition of
hand gestures. The interest seems to come from many sectors. The difficu
lty of the problems involved make
it a good application domain for vision researchers interested in recognition of complex objects [Cui and
Weng, 1995, Kervrann and Heitz, 1995]. The freedom from the need for interface devices has the interest of
king on virtual environments [Darrell and Pentland, 1995]. The potential for practical applications
is sparking work on various aspects of interface design [Freeman and Weissman, 1995, Crowley et al., 1995].
A stereo vision and optical flow system for mode
ling and recognizing people gestures was developed in
[Huber and Kortenkamp, 1995]. The system is capable of recognizing up to six distinct pose gestures, such as
pointing or hand signals, and then interpreting these gestures within the context of intellig
Gestures are recognized by modeling the person's head, shoulders, elbows, and hands as a set of proximity
spaces. Each proximity space is a small region in the scene measuring stereo disparity and motion. The
proximity spaces are co
nnected with joints/links that constrain their position relative to each other. A robot
recognizes these pose gestures by examining the angles between links that connect the proximity spaces.
Confidence in a gesture is build up logarithmically over time as
the angles stay within limits for the gesture.
This stereo vision and optical flow system runs completely onboard the robot and has been integrated into a
Reactive Action Package (RAP). The RAP system takes into consideration the robot's task, its current
and the state of the world. When these considerations are taken care of, the skills that should run to
accomplish the tasks are selected.
The Perseus system [Kahn et al., 1996] addresses the task of recognizing objects people pointed at. The
uses a variety of techniques, called feature maps (such as intensity feature maps, edge feature maps,
motion feature maps, etc.). The objective of these maps is to solve this visual problem in non
worlds reliably. Like the RAP reactive executio
n system, Perseus provides interfaces for symbolic higher
3. System Design and Architecture
Figure 1 describes the system architecture.
Figure 1. The System Architecture
Figure 2 describes the system flow diagram for remote tasks’ performing.
Figure 2. System Flow Chart
behavior of the system is described in eight major steps:
A visual gesture recognition language was developed. A robot
accomplished tasks guided by a
sequence of human hand gestures through the proposed visual gesture recognition was implemented.
g process was in which several handreds of hand pose images are inserted into a database
was implemented. That includes all possible system hand gestures. Every picture got identification
number, and other features, such as height and width. At this stage,
for each one of the pictures, a
vector, which contains 13 parameters, was built. These vectors contain the height / width (aspect
ratio) and the gray scale levels of sub blocks.
A Fuzzy C
Means clustering process was implemented. During this process, memb
were built for each gesture image. These membership functions are relative to the number of signs in
the language. For each gesture image, a membership vector was built.
In this stage, a performance index (cost function) was built. This fu
nction is built for estimating how
optimal the system is, from the aspect of building the clusters.
The user controlled a robot in real time by hand gestures. In this step a real time image processing
process was used.
The user got visual feedback from the
remote robotic scene.
The user performed a task, in which a yellow wooden box, placed on a plastic cup structure, removed,
using hand gestures and visual feedback only.
3.1 A255 Robot
Industrial robots are software programmable, general
rs that typically perform highly
repetitive tasks in a predictable manner. The function of an industrial robot is determined by the
configuration of the arm, the tool and the motion control hardware and software.
Pick and place, assembly, spot or continuou
s bead welding and coating or adhesives applications are a few
common tasks commonly performed by robots.
Industrial robots come in a wide variety of sizes and configurations and many are configured to perform
specialized functions. Welding and coating app
lication robots are examples of specialty robots; they have
controls and program codes that are specific to these operations. In addition, some robot configurations are
better suited for assembly than other general
purpose operations, although they could b
e programmed to
Figure 3. CRS Robotics Model A255 5
Axis Articulated, Human Scaled Robot.
purpose robot, the A2
55, that is used is an articulated arm configuration manufactured by CRS
Robotics. The A255 is termed a human scaled robot because the physical dimensions of the robot arm are
very similar to those of a human being. Some key features of the A255 are: 5 deg
rees of freedom; 2 kilogram
maximum payload; 1.6 second cycle time, 0.05 mm repeatability and a 560 mm reach. Figure 3 shows the
A255 and controller as depicted on the
CRS Robotics Web page
The A255 r
obot model is unmatched in its class of small
articulated robfor high performance, reliability, and
With five degrees of freedom, the A255 Robot performs much like a human arm. In fact, the A255 Robot can
handle many tasks done by humans. Li
ke the entire family of robots, the A255 is working throughout the
world in industrial applications such as product testing, material handling, machine loading and assembly.
The A255 also has many uses in robotics research, educational programs, and the ra
pidly expanding field of
The A255 Robot is programmed using RAPL
3 programming language. The English
like syntax of RAPL
is easy to learn, easy to use and yet powerful enough to handle the most complex of tasks. Programming
s include continuous path, joint interpolation, point
point relative motions, a straight
line, plus on
line path planner, to blend commanded motions in joint or straight
The A255 Robot is designed to work with a wide variety of peripherals in
cluding rotary tables, bowl feeders,
conveyors, laboratory instruments, host computers and other advanced sensors. For maximum process
flexibility, the A255 robot can be interfaced with third
party machine vision systems.
3.2 Video Capturing Card
ion 1000 (MV
1000) is a monochrome video digitizer board, which can be plugged into the PCI
(Peripheral Component Interconnect) bus [
]. The MV
1000 Product line includes a base board with
ra support, a Memory Expansion module (MV
1200), and a group of specialized acquisition
in modules. In this latter group the Digital Camera Interface module (MV
1100), RGB Color Module
(MV1300), and NTSC/PAL
Y/C Color Module (MV1350) are available.
this project, the MV
1300 acquisition module is used. It plugs onto the MV
1000 PCI Bus Video Digitizer
and can connect to two sets of RGB inputs or two monochrome video inputs. Real time display on the VGA
card is possible. The add
on memory module MV
00 is required to store a full frame of RGB color.
1000 Video Capture Board (figure 4) digitizes standard or non
standard analog camera video into 8
bits per pixel at rates up to 40 million samples per second. The digitized video is stored in on
The 1Mbyte board memory can be expanded to 4 Mbytes with the optional MV
1200 Memory Expansion
Figure 4. The MV
1000 Video Capture Boa
The MuTech M
Vision 1000 frame grabber is a full sized PCI bus circuit board. The PCI bus has a number
of distinct architectural advantages that benefit video image capture when working at high frame rates, high
spatial resolution, or high color resolut
The PCI bus can transfer data at rates up to 130 Mbytes per second and can run at 33 MHz without
contending with the system processor or other high
speed peripherals, such as SCSI disk controllers. It is also
truly platform independent, and is not rig
idly associated with a "PC" containing an Intel processor. Digital
Equipment Corporation is offering an “Alpha” processor based workstation with a PCI bus. Apple is offering
the PCI bus on a Power PC based workstation.
On a suitably configured Pentium proc
essor “PC”, the MuTech M
Vision 1000 is capable of transferring data
to PC or VGA memory at up to 55 Mbytes per second.
1000 Software Development Kit enables developers to program the MV
1000 board and create
various applications using the C languag
e. A series of MV
1000 SDKs have been designed and implemented
to work under each of MS
DOS, Windows 95/98/NT and OS
2 operating environments.
1000 SDK helps programmers and to easily set up the frame grabber to work with various types of
t provides a set of built
in configurations for standard or widely used cameras and a group of more
than 30 camera configuration files.
The SDK provides a set of standard or widely used camera configurations. A rich set of functional
application programs i
nterfaces is provided to accomplish this goal. It also provides pre
parameters to set up the MV
1000 so that it will work with the various types of cameras, such as: RS
CCIR and camera configuration files for many Kodak, Pulnix, Dalsa, Dage
, EG&G and DVI.
The SDK includes a group of fine tuned image utility routines. Application developers can use these routines
to save grabbed image frames in various image file formats, such as TIFF, TARGA, JPEG, BMP, and/or to
display image files.
“HomeConnect” PC digital camera USB (Universal Serial Bus) provides video snapshots, video
email, videophone calls over networks. It adjusts automatically from bright to dim light, detaches fr
for easy mobility. The camera appears in figure 5:
Figure 5. 3Com HomeConnect PC Digital Camera
3.4 Video Imager
This Panasonic video imager (
figure 6) is no longer available on the markets anymore. Any other quality
analog camera can replace it.
Figure 6. Video Imager
velopment steps are as follows:
Obtaining the robot details: Information regarding interfaces, both hardware and software in the robot
controller that could allow it to exchange data with the external environment.
Determining interface techniques and metho
ds of data exchange: Once the technique of tapping data
from and sending data to the robot controller is determined, it is then possible to decide the methods
to be used in exchanging data with the robot controller. Commands are constantly being sent to th
robot controller to update the robotic variables (e.g., joint positions, gripper state, etc.). This
information is in turn accessed over the network. The method of accessing the robotic variables is
determined. This may mean choosing the programming lang
uage that allows the model to send and
receive data using the same protocol. Under the Windows platform, common protocols are used for
such data exchanges include: TCP/IP (transmission control protocol/internet protocol), DDE
(dynamic data exchange), OLE (
object linking and embedding) and ActiveX.
Choosing the programming language to build the interface: The language chosen must suit the robotic
application and should be able to use the data exchange protocol to communicate with the robot
as well as the
sture model and verifying the model behavior: Having chosen the programming
language that will allow exchange of data into and out of the model, the next step is to build and
verify against the physic
channel interface: Having constructed the
model, codes are
then inserted in the model to enable handshaking and data transfer to and from the interface program.
Codes are also inserted into the program to d
o a similar function. It is not necessary at this stage to
implement all the data variables, merely a few; to prototype the interface.
channel interface: The interface is tested until it is error
Building the channel
terface: Commands are being sent to the robot controller to enable this
exchange of data. Codes are inserted into the program to do the same.
Testing the channel
robot interface: The interface is tested again until it is error free.
Building interfaces wit
h full capability: Once the methodology of data transfer is stabilized, tapping
of all necessary data is implemented.
Developing and using algorithms: scheduling, collision avoidance and algorithms for path planning.
Testing and improving those algorithms.
Full integration testing.
4.1 Choosing the Programming Language to Build the Interface
In general, the particular work in this project involves heavy use of numerical modeling with vast numbers of
image pixels. So any language chosen should support this s
ort of processing. In this project, much of the
work involves the production of image displays in real time. Any solution chosen must provide a way to
However, while C/C++ costs roughly ten times the number of lines of code, the behavior
of that code is more
directly under the control of the progr, leading to fewer surprises in the field. The ultimate cost of a body of
code must be the sum of the creation time and the future maintenance time. Since scientific software
constantly changes, t
he best hope is that the language itself does not force maintenance activities in addition
to the evolutionary activities.
Finally, in light of the need to be more productive in non
numerical aspects one would hope for rich and well
supported data structur
es in the language. A programming language such as Fortran offers little beyond
numeric arrays. On the other hand C/C++ can produce essentially anything, as long as the programmer is
prepared to pay the cost of custom development. This situation is improvi
ng slightly now in light of the
Standard C++ Template Library. But C/C++ always costs heavily during the development and debugging
As a bonus, if the language promotes the development of generic solutions then these can be later reused at
st thereby allowing their development costs to be amortized over future programs. A well designed
C++ routine ought to be able to pay for itself in this way, but the language does not promote solid generic
design. Thus, it is harder than one might think to
develop fully reusable code in C++.
In this project, all the code written in C or C++ with the help of the Intel Image Processing Library. This
assures image processing in real time.
The Intel Image Processing Library focuses on taking advantage of the pa
rallelism of the new SIMD (single
data) instructions of the latest generations of Intel processors. These instructions greatly
improve the performance of computation
intensive image processing functions. Most functions in the Image
ocessing Library are specially optimized for the latest generations of processors.
The Image Processing Library runs on personal computers that are based on Intel architecture processors and
Microsoft Windows 95, 98, 2000 or Windows NT operating s
ystems. The library integrates into the
customer’s application or library written in C or C++.
4.2 Developing and using the Gesture Recognition Algorithm
In the proposed methodology we suggest the use of fuzzy C
means [Bezdek, 1974] for clustering.
means (FCM) clustering is an unsupervised clustering technique and is often used for the
unsupervised segmentation of multivariate images. The segmentation of the image in meaningful regions
with FCM is based on spectral information only. The geometrical
relationship between neighboring pixels is
not used [Noordam et al., 2000].
The use of Fuzzy C
Means clustering to segment a multivariate image in meaningful regions has been
reported in literature [Park et al., 1998, Stitt et al., 2001, Noordam et al., 20
00]. When FCM is applied as a
segmentation technique in image processing, the relationship between pixels in the spatial domain is
completely ignored. The partitioning of the measurement space depends on the spectral information only.
As multivariate imagi
ng offers possibilities to differentiate between both objects of similar spectra and
different spatial correlation’s, FCM can never utilize this property. Adding spatial information during the
spectral clustering has advantages above a spectral segmentatio
n procedure followed by a spatial filter, as the
spatial filter can not always correct segmentation errors. Furthermore, when two overlapping clusters in the
spectral domain correspond to two different objects in the spatial domain, usage of a priori spati
information can improve the separation of these two over
The fuzzy c
means clustering algorithm is described mathematically as follows: Given a set of n data
, the FCM algorithm minimizes the weighted within group sum of squared error
dimensional data vector,
is the prototype of the center of cluster,
is the degree of membership of
is a weighting exponent on each fuzzy membership,
is a distance measure
and cluster center
is the number of objects and
is the number of clusters. A solution
of the ob
can be obtained via an iterative process where the degrees of membership
and the cluster centers
with the constraints:
In the proposed methodology, initially, using a specific fuzzy C
means algorithm to partition the data
generates a classifier. Onc
e the clusters have been identified through training, they are labeled. Labeling
means assigning a linguistic description to each cluster. In our case, the linguistic description is the name of
During the training process several handreds of h
and gesture images are inserted into a database. That
includes all possible system hand gestures. At least 25 frames out of each of the 12 hand gestures in the
language are taken. Every picture gets an identification number, plus a feature vector. The vect
or of features
is also inserted into the database. At this stage, for each one of the pictures, a feature vector, which contains
13 parameters, is built. These vectors contain the height / width and sub
block gray scale values. The sub
block gray scale val
ues are obtained by taking an image and divide it by 3 rows and 4 columns, then take the
norm of each cell in the image.
The selection of 12 gray scale values for each frame empirically was found to provide enough discriminatory
power for individual gestur
e classification. Additional experimentation of the optimal block size is left for the
During the Fuzzy C
Means clustering process, membership values are determined for each gesture type. The
number of membership is equal to the number of gestures
in the language. For each picture, a membership
vector is built. These membership values are inserted into a database in a matrix form like. The matrix
columns are the gesture signs (1
12), and the rows are the pictures taken (1
couple of handreds). Ea
in the matrix gets a numeric value (0
1000) that represents how close a gesture to a picture is. 1000 means
that there is a high membership relation, and 0 means that there is no relation at all.
In this stage, a performance index (cost function)
is built. This function is built for estimating how optimal
the system is, from the aspect of building the clusters.
Figures 7 and 8 describe the training process.
Figure 7. A successful Recognition
re 8. Inserting Two Training Gestures
4.3 The Hand Gesture Language
Hand gestures are a form of communication among people [Huang and Pavlovi'c, 1995]. The use of hand
gestures in the field of human
computer interaction has attracted new interest in the pa
st several years.
Computers and computerized devices, such as keyboards, mice and joysticks may become as “stone
devices. Only in the last several years has there been an increased interest in tring to introduce the other
means of human to computer i
Every gesture is a physical expression of a mental concept [Thieffry, 1981]. A gesture is motivated by an
interaction to perform a certain task: indication, rejection, grasping, drawing a flower or something as simple
as scraching one’s head [H
uang and Pavlovi'c, 1995].
accomplished tasks guided by a sequence of human hand gestures through the proposed visual
gesture recognition language can be seen in figure 9.
Figure 9. The A255 arm has five axes of motion (joints): 1 (waist), 2 (shoulder), 3 (elbow), 4 (wrist pitch) and 5 (tool roll
A visual gesture recognition language was developed for this robot (figure 10).
Figure 10. Visual Gesture Recognition Languag
hand gestures control the X
hand gestures control the Y, and the
hand gestures cont
rol the Z
axis of the A255 robot arm. The
hand gestures control
joint 5, the
and Close Grip
gestures control the robot gripper. The
hand gesture quits any
action the robot performs. The
hand gesture resets and c
alibrates the robot joints automatically.
4.4 Building Gesture Control Interface
A gesture control interface was built (figure 11) for controlling a remote robot. On the left upper corner there
is a real time hand gesture picture. On the right upper corner
there is the segmented hand gesture. On the left
lower corner there is the selection of the 12 gray scale values. Below the segmented hand gesture picture
there are thumnails of the 12 different hand gestures and a bar graph that shows if a specific gestu
recognized. The levels of these bars are proportional to the membership level of the current gesture to all the
gesture classification. Below that there is a real video feedback from the remote robot site.
Figure 11. Gesture Control Interface
To control a robot, two control modes are available:
a) Continuos movement commands: If a gesture command is given (e.g. up) then the robot moves
ly until a stop command is given. All gesture commands must follow (and be followed) by a stop
b) Incremental movement commands: A gesture command all of time after being received by the robot at
time t is carried out over the interval
t. Then at
time t +
t, the robot is ready for a new command. For
example, If the left gesture is given, the robot moves during time
t, then if the same gesture left is held, it
moves again by time
t, then if the gesture is changed to right it moves right by time
This way one can easily position the robot over an object by juggling it left
left, etc., until the correct
position is observed. In this method, it is not necessary to go through the stop gesture every time a gesture is
changed. In this project,
this mode was chosen for controlling a remote robot.
4.5 The User
Robot Communication Link
4.5.1 Background on Communication Architecture and Protocols
The Internet evolved from ARPANET, the U.S. Department of Defense’s network created in the late 1960s.
ARPANET was designed as a network of computers that communicated via a standard protocol, a set of rules
that govern communications between computers. While the original host
host protocol limited the
potential size of the original network, the developm
ent of the TCP/IP (Transmission Control
Protocol/Internet Protocol) enabled the interconnectivity of a virtually unlimited number of computers.
Every host on the Internet network has a unique Internet Protocol (IP) address and a unique Internet
or the robot server, the Internet address is 126.96.36.199, for the hand gesture client computer it
is 188.8.131.52, and for the web camera server it is 184.108.40.206 (figure 12):
Figure 12. User
Robot Communication Architecture
TCP is responsible for making sure that the commands get through to the other end. It keeps track of what is
sent, and re
transmits anything that did not get through. If any mess
age is too large for one datagram, e.g. the
hand gestures rate is too large, TCP will split it up into several datagrams, and make sure that they all arrive
correctly. Since these functions are needed for many applications, they are put together into a sep
protocol. TCP is considered as forming a library of routines that applications can use when they need reliable
network communications with another computer. Similarly, TCP calls on the services of IP. Although the
services that TCP supplies are neede
d by many applications, there are still some kinds of applications that do
not need them. However there are some services that every application needs. So these services are put
together into IP.
As with TCP, IP is also considered as a library of routines
that TCP calls on, but which is also available to
applications that do not use TCP. This strategy of building several levels of protocol is called "layering".
Applications programs such as mail, TCP, and IP, are considered as being separate "layers", each
calls on the services of the layer below it. Generally, TCP/IP applications use 4 layers:
a) An application protocol such as mail
b) A protocol such as TCP that provides services need by many applications:
IP, which provides the basic service of g
etting datagrams to their destination.
the protocols needed to manage a specific physical medium, such as Ethernet or a point to point line.
TCP/IP is based on the "catenet model" (This is described in more detail in IEN 48). This model assumes that
are a large number of independent networks connected together by gateways. The user should be able to
access computers or other resources on any of these networks. Datagrams will often pass through a dozen
different networks before getting to their final d
estination. The routing needed to accomplish this should be
completely invisible to the user. As far as the user is concerned, all he needs to know in order to access
another system is an "Internet address". This is an address that looks like 220.127.116.11
4. It is actually a 32
bit number. However it is normally written as 4 decimal numbers, each representing 8 bits of the address.
Generally the structure of the address gives some information about how to get to the system. For example,
132.72 is a network
number assigned by BGU (
Ben Gurion University of the Negev
). BGU uses the next
numbers to indicate which of the campus Ethernets is involved. 132.72.135 happens to be an Ethernet used
by the Department of Industrial Eng
ineering and Management. The last number allows for up to 254 systems
on each Ethernet.
Gesture commands, are sent to the remote robot over the TCP/IP protocol in groups of 5 for assuring that the
recognized gesture sent from the operator’s site is the rig
4.5.2 Integrating the Camera into the System
For grabbing pictures from the USB (Universal Serial Bus) web camera, the FTP (File Transfer Protocol)
was used. FTP is an Internet communication protocol that allows uploading and downloading files from
machine connected to a local machine via the Internet. FTP is composed of two parts; an FTP client and an
FTP server. The FTP client (Web Camera Client in figure 10) is the software executed on a local machine to
send or receive files. The FTP server is
software which executes on a server machine on which the files are to
be saved or retrieved (Hand Gesture Server in figure 10).
To be able to send files to the FTP server and Web server three pieces of information should be provided:
The name of the FTP s
Each FTP server on the Internet executes on a separate machine.
Machine's on the Internet have both DNS (Data Name Service) names (e.g. bgu.ac.il, jobjuba.com,...)
and TCP/IP addresses (e.g. 18.104.22.168, 22.214.171.124,...). The name or TCP/IP addre
ss of the
machine hosting the FTP service must be provided.
The user identification login information to the FTP server must be provided
FTP servers are
protected in such way that only authorized users have the ability to save and retrieve files. To gain
access to FTP server, the administrator of the specific server should provide a user identification and
a password to access the FTP server.
The directory in which to save files should be defined
When connected to FTP server, it allows
uploading of fil
es only to particular directories.
5. Testing the System
5.1 Task Definition
We are given a robot (A255) under the control of an individual and a plastic structure located on a flat
platform as can be seen in figures 13 and 14:
Figure 13. A255 Robot and a Plastic Cup Structure
Figure 14. Close Loo
k at the Plastic Cup Structure
The main task is to push a yellow wooden box into a plastic cup structure using hand gestures and visual
feedback only (figures 15 and 16):
Figure 15. A255 Robot, Plastic Cup Structure, and Yellow Box
Figure 16. Close Look at the Plastic Cup Structure and the Wooden Bo
5.2 Experimental Results
We are given A255 robot under the control of an individual and a plastic cup structure located on a flat
platform. For testing and evaluating the system, an experiment was defined. An operator had to control A255
robot and perfor
m a remote task in real time by his hand gestures. The task was to push a yellow wooden
cube, located on a top of a piller, into a container adjacent to it. The robot was controlled using hand gestures
and visual feedback. The operator performed a set of t
en identical experiments. Times to perform each
experiment were taken. A resulting learning curve (Figure 17) indicates rapid learning by the operator.
Figure 17. Learning Curve of the Hand Gesture System*
* Note that standard times were reached after four to six trials.
Figure 18. The Overall View of
the Experimental Setting
Figure 19. A Typical Control Sequence to Carry out the Task
6.1 Summary of Results
This project has described the design, impl
ementation and testing of a telerobotic gesture
based user interface
system using visual recognition.
Two aspects of the problem have been examined, the technical aspects of visual recognition of hand gestures
in lab environment, and the issue concerning t
he usability of such an interface implemented on a remote
Experimental results showed that the system satisfies the requirements for a robust and user friendly input
The design of the system has incorporated several advances in visual r
ecognition including FCM (fuzzy c
means) algorithm. While segmentation is not perfect, it is fast, of generally good quality, and sufficiently
reliable to support an interactive user interface.
Being appearance based, the assumption that the appearance of
the hand in the image is relatively constant
was made. Given the stable environment of the system, this assumption is generally valid. Of course some
variation does occur, especially between people and even between poses formed by one person. This has
taken into account both by using a representative set of training images, and by varying those images
during training. Variations in lighting are handled by pre
processing the image to reduce its effect. A final
contributor to the success of the networks
is a step where images which have been misclassified by the net
after initial training, are added to the training set. This helps to fine tune performance on difficult cases.
The path of the hand is smoothed with a novel algorithm which has been designed t
o compensate for the
types of noise present in the domain, but also to leave a bounding box movement which is easy for the
system to examine for motion features, and appears natural to the user.
Just as with natural gesticulation, motion and pose both play
a role in the meaning of a gesture. Symbolic
features of the motion path are extracted and combined with the classification of the hand's pose at key points
to recognize various types of gestures.
Using the TCP/IP protocol and transferring of information
as a sequence of "datagrams”, the communication
reliability between the system computers was assured.
The result is working system, which allows a user to control a remote robot using his hand gestures. One can
analyze how gestures are best used to intera
ct with objects on a screen. Problems with integrating gesture into
current interface technology have been pointed out, such as the design of menus and proliferation of small
control icons. This work has also pointed out inherent characteristics of gesture
that must be considered in the
design of any gesture
While this project has shown that gesture can be made to work as a reliable interface modality in the context
of today's graphical user interfaces, it leaves a significant high level qu
estion unanswered; Can gesture
provide sufficient benefits to
the user to justify replacing current interface devices?
Many people see speech recognition as a major player in the interface of future workstations because of the
freedom it provides the user
, but speech by itself can not make a complete interface. There are some
operations that it simply can not express well, or which are much more concise with some type of spatial
command. Mice, however, will become increasingly impractical as we move away f
rom traditional screen
desk environments. Gesture provides many of the same benefits for spatial interaction tasks as voice does
for textual tasks.
Together, speech and gesture have the potential to dramatically change how we interact with our machines.
In the near term they can remove many of the restrictions imposed by current interface devices. In the longer
term they offer the potential for machines to operate by observing us rather than by always being told what to
do. It is our sincere hope that th
is work has contributed to realizing that potential.
We would like to notice that all of the system files are presently located in the
Gurion University of the Negev
Department of Industrial Engineering and Managemen
and if a
copy of these files is desired, it will be provided. Demonstration of the system will also be given upon
6.2 Future Work
The system as implemented is capable of recognizing hand gestures and controlling a remote robot in a
ting in real time. It is very likely that with some engineering it could become a reliable add
current window systems. It would be easy to extend the system's capabilities so gesture could take on more of
the interface duties. Solutions to remaining
problems of reliability and accuracy appear to be relatively
It examines what contributed to the success of the system at this level, and what holds it back from better
performance. This work also demonstrates the potential of hand gestur
e as an input device. After an initial
learning curve, an experienced user can manipulate objects on a remote site with speed and comfort
comparable to other popular devices. The ability to interact directly with on
screen robot and objects seems
to be mor
e comfortable to some users than the indirect pointing used in a mouse or joystick.
Several examples for future work are:
1. A comparative evaluation using the following gesture control methods:
Hand pendant control.
Mouse and keyboard control.
ol, if equipment become available.
Joystick control, if equipment become available.
2. Use of the hand gesture system to control a mobile robot.
3. Use of the hand gesture system to control a virtual robot (fixed or mobile) in a 3D virtual reality model.
. More extensive analysis of the learning curve.
5. Optimize the feature vector.
6. An experiment for comparing both incremental and continuos control methods.
Baudel T. and Rafon M. B. 1993. CHARADE: Remote Control of Objects using Free
Communications of ACM. vol.36, no.7, pp. 28
Bezdek J. C. 1974. Cluster validity with Fuzzy Sets. Cybernetics 3 58
Bezdek. J. C. 1981. Pattern Recognition with Fuzzy Objective Functions. Plenum Press, New York.
Bolt R. A. 1980. Put
There: Voice and Gesture at the Graphics Interface. Computer Graphics,
14(3). pp. 262
Browse R.A. and Little S.A. 1991. The Effectiveness of Real
Time Graphic Simulation in
Telerobotics. IEEE Conference on Decision Aiding for Complex Systems, vol.
2, pp. 895
Burger J. 1993. The Desktop Multimedia Bible. Addison
Wesley Publishing Company.
Cao Y. U., Chen T. W., Harris M. D., Kahng A. B., Lewis M. A. and Stechert A. D. 1995. A Remote
Robotics Laboratory on the Internet, Commotion Laboratory, UCL
A CS Dept., Los Angeles.
Cassell J., Steedman M., Badler N., Pelachaud C., Stone M., Douville B., Prevost S. and Achorn B.
1994. Modeling the Interaction Between Speech and Gesture. Proceedings of the 16th Annual
Conference of the Cognitive Science Society
, Georgia Institute of Technology, Atlanta, USA.
Crowley J., Berard F. and Coutaz J. 1995. Finger Tracking as an Input Device for Augmented Reality.
In Proceedings Intelligence Workshop on Automated Facand Gesture Recognition. Zurich.
Cui Y. and Weng J. 19
95. 2D Object Segmentation from Fovea Images Based on Eigen
Learning. In Proceedings IEEE Intelligence Symposium on Computer Vision.
Dario P., Guglielmelli E., Genovese V. and Toro M. 1996. Robot assistants: Applications and
Evolution. Robotics an
d Autonomous Systems, vol. 18, pp. 225
Darrell T. and Pentland A. 1995. Attention
Driven Expression and Gesture Analysis in an Interactive
Environment. In Proceedings Intelligence Workshop on Automated Face and Gesture Recognition.
N. and Mavor S. N. 1994. Virtual Reality: Scientific and Technological Challenges, pp.
Ejiri M. 1996. Towards Meaningful Robotics for the Future: Are We Headed in the Right Direction.
Robotics and Autonomous Systems, vol. 18, pp. 1
Freeman W. a
nd Weissman C. 1995. Television Control by Hand Gestures. In Proceedings
Intelligence Workshop on Automated Face and Gesture Recognition. Zurich.
Fels S. and Hinton G. 1995. Glove
TalkII: An Adaptive Gesture
Format Interface. Proceedings of
ACM CHI '95
Conference on Human Factors in Computing Systems, pp. 456
Franklin C. and Cho Y. 1995. Virtual Reality Simulation UsingHand Gesture Recognition. University
of California at Berkeley,Department of Computer Science.
Goldberg K., Maschna M. and Gentner S
. 1995. Desktop Teleoperation Via The WWW, Proceedings
of the IEEE International Conference on Robotics and Automation, pp. 654
Goldberg K., Santarromana J., Bekey G., Gentner S., Morris R., Wiegley J. and Berger E. 1995. The
Telegarden. In Pro
ceedings of the ACMSIGGRAPH.
Huang T. S. and Pavlovi'c V.I. 1995. Hand Gesture Modeling, Analysis, and Synthesis. University of
Illinois, Beckman Institute.
Ioannides A. and Stemple C. 1998. The Virtual Art Gallery / Streaming Video on the Web,
Proceedings of the conference on SIGGRAPH 98: Conference Abstracts and Applications, pp.154.
Kahn R. E., Swain M.
J., Prokopowicz P.N. and Firby J. 1996. Gesture Recognition Using the Preseus
Architecture. Department of computer science. University of Chicago.
Kawamura K., Pack R., Bishay M., and Iskarous M. 1996. Design Philosophy for Service Robots,
Robotics and Au
tonomous Systems, vol. 18, pp. 109
Kervrann C. and Heitz F. 1995. Learning Structure and Deformation Modes of Non
Rigid Objects in
Long Image Sequences. In Proceedings Intelligence Workshop on Automated Face and Gesture
D. Huber E. and Bonasso P. 1995.
Recognizing and Interpreting Gestures on Mobile
Robot. Robotics and Automation Group. Houston. NASA.
Krueger M. W. 1991. Artificial Reality II. Addison
NcKee G. 1995. A Virtual Robotics Laboratory for Research, SPI
E Proceedings, pp. 162
Lee J. and Kunii T. 1995. Model
Based Analysis of Hand Posture. IEEE Computer Graphics and
Applications (SIGGRAPH'95). vol. 15, no. 5. pp. 77
McNeill. D. 1992. Hand and Mind.University of Chicago Press.
Noordam J. C., van de
n Broek W.H.A.M. and Buydens L.M.C. 2000.
Geometrically Guided Fuzzy C
means Clustering for Multivariate Image Segmentation. Proceedings of the International Conference
on Pattern Recognition.
Park S. H., Yun I. D. and Lee S. U. 1998. Color Image Segmentat
ion Based on 3
Morphological Approach. Pattern Recognition, 21(8):1061
Pierce J., Forsberg A., Conway M., Hong S., Zelesnik and R. Mine M. 1997. Image Plane Interaction
Techniques in 3D Immersive Environments. Proceedings of the 1997 Sym
posium on Interactive 3D
Preece J., Rogers Y., Sharp H., Benyon D., Holland S. and Carey T. 1994. Human
Welsey Publishing Company.
Quek F. 1993. Hand Gesture Interface for Human
Machine Interaction. In Proceedings of
Robot Institute of America, 1979.
Sheridan T. B. 1992. Telerobotics, Automation, and Human Supervisory Control, Cambridge: MIT
Sheridan T. B. 1992. Defining our terms, Presence: Teleoperators and Virtual Environments 1:272
Sorid D. and Moore S.K. 2000. The Virtual Surgeon, IEEE SPECTRUM July. pp. 26
Starner T. and Pentland A. 1995. Visual Recognition of American Sign Language Uses Hidden
Markov Models, International Workshop on Automatic Face and Gesture Recognition.
Stitt J. P., Tutwiler R. L. and Lewis A. S. 2001. Synthetic Aperture Sonar Image Segmentation Using
the Fuzzy C
Means Clustering Algorithm. Autonomous Control and Intelligent Systems Division The
Pennsylvania State Applied Research La
Sturman, D. and Zeltzer, D. 1994. A Survey of Glove
Based Input. IEEE Computer Graphics and
Applications vol., no. 1., pp. 30
Taylor K. and Dalton B. 1997. Issues in Internet Telerobotics. International Conference on Field and
Taylor K. Trevelyan J. 1995. Australia's Telerobot on the Web. 26th International Symposium on
Industrial Robots. Singapore.
Thieffry S. 1981. “Hand Gestures”, in The Hand (R. Tubiana, ed.), pp. 482
492, Philadelphia, PA:
I., Berson G., Estrin G., Eterovic Y. and Wu E. 1994, Strong Sharing and Prototyping Group
Applications, IEEE Computer, 27(5): 48
Waldron M. B. and Kim S. 1995. Isolated ASL Sign Recognition System for Deaf Persons, IEEE
Transactions on Rehabilitation
Engineering. pp. 261
Vogler C. and Metaxas D. 1998. ASL Recognition Based on a Coupling Between HMMs and 3D
Motion Analysis, Proceedings of the ICCV’98. pp.363
Wexelblat A. 1995. An Approach to Natural Gesture in Virtual Environments. ACM Transactions on
Computer Human Interaction, vol.2, no.3. pp. 179
Wilson A., Bobic A. and Cassell J. 1996. Recovering the Temporal Structure of Natural Gesture. In
gs of the Second Intelligence Conference on Automatic Face and Gesture Recognition,
Triesch J. and Malsburg C. V. D. 1998. A Gesture Interface for Human
Robot Interaction,” in
Proceedings of 3th IEEE International Conference on Automat
ic Face and Gesture Recognition, pp.
Developer: Kartoun Uri