arXiv:0908.2656v1  [cs.CV]  19 Aug 2009
Semantic Robot Vision Challenge: Current State and Future Directions
Scott Helmer, David Meger, Pooja Viswanathan, Sancho McCann,
Matthew Dockrey, Pooyan Fazli, Tristram Southey, Marius Muja,
Michael Joya, Jim Little, David Lowe, Alan Mackworth
Department of Computer Science
University of British Columbia
shelmer@cs.ubc.ca
Abstract
The Semantic Robot Vision Competition provided
an excellent opportunity for our research lab to
integrate our many ideas under one umbrella, in-
spiring both collaboration and new research. The
task, visual search for an unknown object, is rel-
evant to both the vision and robotics communities.
Moreover, since the interplay of robotics and vision
is sometimes ignored, the competition provides a
venue to integrate two communities. In this paper,
we outline a number of modiﬁcations to the com-
petition to both improve the state-of-the-art and in-
crease participation.
1
Introduction
Current technology (robotic and otherwise) falls well short
of a human’s ability to perceive the world using vision. A
nearly limitless range of applications would be facilitated by
successful embodied object recognition (i.e., the ability of
a mobile platform to perform human-like visual scene and
object understanding). We believe that with several key ad-
vances in the ability of a computer system to interpret visual
imagery, namely robust object recognition of a large num-
ber of object classes and more capable scene understanding,
future robot systems will substantially enhance the lives of
their users. A robot introduced into a home environment will
quickly be able to respond to commands such as ”Robot, fetch
my shoes!”, assistive mobility devices will be able to deter-
mine whether a dangerous object is in the user’s path, and
navigation systems will aid travelers by identifying accidents
and construction delays.
In order to accelerate the progress of state-of-the-art re-
search, many ﬁelds in science and engineering have employed
standardized benchmarks or data sets to evaluate similar tech-
niques and provide a means for their comparison. However,
these measures can be detrimental when they do not reﬂect
the reality or complexity of the problem in question. If the
benchmark represents a severe simpliﬁcation of reality, its use
for evaluation of techniques may lead to overconﬁdence in a
system’s accuracy and robustness. In addition, they may dis-
courage research directions that are not aligned with success
on such benchmarks.
Research competitions, while potentially possessing these
same limitations, are more desirable than standard bench-
marks on many points. First, like standard benchmarks they
provide a context in which participants can evaluate their
techniques under uniform conditions and make meaningful
comparisons. Second, their periodic nature allows the com-
petition to evolve with the state-of-the-art, and discourages
techniques tailored to speciﬁcs of a particular benchmark. Fi-
nally, they provide an exciting venue that brings together a
community for collaboration and synthesis. However, this as-
sumes that those engaged in the state-of-the-art research par-
ticipate actively in such competitions, otherwise the events
become merely a venue for displaying known techniques.
Although there have been competitions focused upon a va-
riety of robotic tasks, these have tended to minimize the con-
tribution of vision. Conversely, in vision, particularly ob-
ject recognition, the active acquisition of images for anal-
ysis is generally of secondary concern.
Here, benchmark
datasets and competitions are neither a representative sam-
ple of the real world, nor a sample of how a robot would see
the world. By separating robotics and vision, the majority
of cutting-edge object recognition research has focused only
upon appearance-based approaches, ignoring scene cues that
may prove beneﬁcial for both accuracy and efﬁciency. We be-
lieve that in order to push state-of-the-art methods towards the
challenging goals outlined earlier in this paper, a competition
must bring these communities together by evaluating embod-
ied object recognition systems in realistic environments, and
thus reducing over-simpliﬁcations and erroneous research di-
rections.
A recent competition featuring embodied object recogni-
tion is the Semantic Robot Vision Challenge (SRVC). The
overall task in this contest is similar to a photo-scavenger
hunt in an unknown indoor environment, with information on
the objects typically acquired from the Internet. This setting
brings together numerous sub-ﬁelds of AI, including vision,
robotics, and natural language processing, along with Internet
search technologies. Although this competition does involve
embodied vision and can help stimulate robotics and vision
research, it has yet to gain notoriety in the research commu-
nity and signiﬁcantly advance the state-of-the-art.
Drawing upon our experience as a competitor in the SRVC
for the past two years, we have identiﬁed issues in both
robotic competitions and embodied recognition. We provide
an outlook for the future of the SRVC that will allow it to
increase its impact on the community. Our contribution in
this respect is two-fold. Firstly, we discuss the value of the
existing SRVC competition to research in embodied vision,
and how it has pushed our own research in new directions.
Secondly, we review possible modiﬁcations to improve the
competition in terms of the research directions it encourages
and the number of participants it attracts.
2
Robotics and Computer Vision
Competitions
Competitions in robotics and computer vision that display
state-of-the-art techniques are a relatively recent phenomena.
This is partially due to the fact that historically, the state-of-
the-art in either domains were not mature enough to handle
compelling tasks. However, beginning with Robocup, com-
petitions have become a somewhat regular feature at both aca-
demic conferences and independent venues. It is worth taking
a moment to consider some of the more successful competi-
tions and the features that have made them relevant and vi-
able.
The premier example of a successful competition is
Robocup [Kitano et al., 1997], pioneered by Alan Mack-
worth [Mackworth, 1993], where robots compete against
each other in a soccer-like setting. With an over-arching goal
of having robots compete against humans in the mid-21 cen-
tury, Robocup has proved to be a valuable education tool and
testbed for many ideas in AI. One of the key features in its
early success was that it offered a variety of leagues for par-
ticipation. Robot and simulation leagues were offered, pro-
viding a venue for state-of-the-art research in robot control as
well as techniques in planning and multi-agent systems. As a
result, the competition has attracted a large number of partic-
ipants, raising the proﬁle of attendant research and providing
a valuable research experience.
More recently, RoboCup@Home is a new RoboCup league
which aims to develop service and assistive robots used in
real-world personal domestic applications. The intent of the
league is to promote the development of robotic technologies
that can assist humans in everyday life. The competition pro-
poses a number of benchmark tasks in a home environment,
where success is determined by the number of tasks which the
entrant’s robot completes. Among one of the benchmarks is
a task to ﬁnd a speciﬁed object in the environment. Although
this contest does contain some aspects of embodied vision, it
does not offer a sufﬁciently challenging task to attract vision
researchers to a competition that is not held in conjunction
with AI or vision conferences. It does, however, offer oppor-
tunities for teams to attempt a wide variety of tasks, each re-
quiring expertise in different areas of research. For example,
while one task might require speech synthesis and aesthetic
presentation, another might evaluate teams on safe naviga-
tion, tracking and human recognition. This setup provides
teams the ﬂexibility to attempt speciﬁc tasks that they have
research expertise in and opt out of others.
Another wildly successful competition in AI was the
DARPA Grand Challenge [Seetharaman et al., 2006], offer-
ing one million dollars to the ﬁrst team which could au-
tonomously complete a 240 kilometer on- and off-road
course. For the ﬁrst Grand Challenge in 2004 the best com-
petitor traveled just 11 kilometers before ﬂipping over and
catching on ﬁre. In the following year, there were ﬁve ve-
hicles which successfully completed the entire course.
In
2007, just three and a half years after the ﬁrst competition, six
teams ﬁnished the Urban Challenge, which mixed robotic and
non-robotic vehicles together in an urban setting and enforced
California trafﬁc laws. The Grand Challenge is the perfect il-
lustration of a competition which pushed the state-of-the-art,
particularly in systems engineering. Prior to the competition
it was widely believed that current technology was simply not
up to the challenge of this difﬁcult task. This success was due
in part to the fact that from the start it was well funded, at-
tracted top-notch research institutions, received wide media
attention, and provided a compelling task.
Another successful competition has arisen in the com-
puter vision community, the Pascal Visual Object Classes
Challenge (VOC) (http://pascallin.ecs.soton.
ac.uk/challenges/VOC/). This is a EU-funded com-
petition which began in 2005 with the goals of providing a
yearly competition for object class recognition and localiza-
tion and a set of standards and tools for evaluating algorithm
performance. In contrast to standard benchmark datasets, en-
trants are evaluated on a novel dataset every year, which pre-
vents algorithms from being tailored speciﬁcally to a single
data set. The key feature of this competition that led to its
success was that it involved high proﬁle researchers at the
organizing level, was held in conjunction with major confer-
ences, and was relatively inexpensive to participants.
3
SRVC and Our Experience
Although the previously mentioned competitions have been
successful, they do not address many of the issues of em-
bodied vision. The SRVC is an ideal competition to push
the state-of-the-art in this ﬁeld. This competition was held
for the ﬁrst time at the Association for the Advancement of
Artiﬁcial Intelligence (AAAI) conference in 2007 in Vancou-
ver, and again in conjunction with the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) 2008 in
Alaska. The competition is a visual search task in an un-
known environment. The entrants are given a list of objects
to ﬁnd in the environment, with the environment containing
only a subset of those listed objects along with additional
distractor objects. Using this list, the robots autonomously
acquire data about these objects from the Internet in a ﬁxed
amount of time. Once data collection and learning are com-
plete the robot searches the unknown environment with the
task of ﬁnding the objects using the data acquired from the
Internet. At the end of the exploration phase the robot returns
an image for each object type containing a single bounding
box around the target. The scoring is based on the bounding
box accuracy. In addition, there is a software league, where
the entrants are not responsible for acquiring images of the
environment. Instead they are given a set of images taken of
the environment that include both the objects and other scene
elements.
Our team entered two versions of our Curious George robot
to the 2007 and 2008 SRVC. We gained a wealth of experi-
ence during our system development process and actual con-
test participation, which will be described in the following
section.
3.1
Internet Data Collection and Filtering
In the SRVC, all of the training data is acquired from the In-
ternet at contest time with no human intervention. For visual
appearance models, this generally means collecting a dataset
of images via an Internet image search engine like Google.
Given the varied nature of Internet image search results, a
system was needed to ﬁlter the output before training could
be performed. We implemented two phases of training data
ﬁltering. The ﬁrst phase removed all cartoons, illustrations,
technical schematics, and other non-photographic images us-
ing a quality score developed in [Ke et al., 2006]. The second
phase prioritized groups of images that displayed a high de-
gree of similarity, since we determined empirically that these
images were more likely to contain the target object.
It is interesting to note that approaches for image ﬁltering
and ranking such as [Fergus et al., 2004] are generally evalu-
ated on a dataset drawn from a similar distribution to the train-
ing data. This is not the case for the SRVC scenario since data
collected by a robot is not likely to be from canonical view-
points in uncluttered backgrounds – two properties that are
common in Internet images. As a result, we did not pre-ﬁlter
results based on generic object categorization techniques.
In addition to acquiring training images, other relevant data
may be found on the Internet, such as contextual clues from
LabelMe [Russell et al., 2008] and size priors from the Wal-
Mart catalog. These all represent important sources of infor-
mation for recognition purposes, but this potential has been
left unaddressed in the literature. Although we were not able
to integrate this information into our system in time for the
SRVC competitions, they have encouraged us to pursue this
direction for collecting additional training data aside form the
traditional image datasets.
3.2
Robotics
The nature of a robotics contest demands the construction of
a physical system with numerous abilities ranging from basic
navigation, to the construction of a distributed computation
system, to performance on the task itself (object recognition
in our case). While each of these individual tasks are easily
achieved, their integration within a physical system presents
a high level of complexity for system designers. For example,
from a research perspective, robot navigation and mapping is
largely a solved problem in indoor environments. However, it
is a signiﬁcant practical challenge to prepare a robot to navi-
gate a previously unseen contest environment where it is re-
quired to visit potentially unsafe locations that provide good
views of objects. Similarly, distributing a computational pro-
cess across several networked processors is not a signiﬁcant
challenge in many situations, but when this system must be
mounted on a mobile robot and thus subject to constraints on
weight, power and size, many difﬁculties present themselves.
For the 2008 SRVC, we developed an active exploration
and real-time vision system in order to announce objects that
were discovered in real-time during the run. Our robot’s ar-
chitecture involved the low power, on-board PC mounted in-
side our Pioneer AT3 robot for low-level control, and four net-
worked laptop systems responsible for: i) real-time process-
ing of visual imagery, visual attention, robotic planning, gaze
planning, and overall control; ii) speciﬁc object recognition;
iii+iv) generic category recognition. The distributed nature of
this architecture required captured imagery to be transfered
between computers via network connection, and the associ-
ated software for sending and receiving components.
Our visual attention system processed imagery obtained
from a stereo camera system in real-time in order to deter-
mine the locations of interesting objects and structures in the
environment. This represented a signiﬁcant new functionality
when compared with our 2007 contest entry that required the
robot to “stop and shoot” before performing visual attention.
The real-time functionality prevented our robot from “driving
blindly” and allowed it to continuously monitor the periph-
eral view until a sufﬁciently interesting location was seen, at
which point foveal images could be collected. This behaviour
allowed the robot to cover the environment rapidly while ig-
noring uninteresting regions and thus capturing images of a
large number of candidate objects. It is unlikely that any of
our team members would have developed such a behaviour
had it not been for the SRVC, since stationary visual atten-
tion is equally easy to demonstrate in academic publication.
However, now that such a system has been developed, our
team has the ability to evaluate the behaviour of interactive,
real-time visual attention on a mobile platform, and this con-
tinues to be an interesting research direction for our research
group.
3.3
Vision
Images collected by a robot during the embodied object
recognition scenario often capture objects from a non-
standard viewpoint, scale, or orientation. In other cases, the
images do not contain an object at all. In fact, during our
SRVC experience, we found that images collected by the
robot rarely contained any target object. As a result, we de-
signed our classiﬁcation system to have a low false positive
rate. We employed a two-stage object detection approach.
The ﬁrst stage used a speciﬁc object recognition system based
on matching SIFT features and geometric consistency that
generally produced few false positives but provided low re-
call for generic object classes. The second stage employed a
generic object classiﬁer based on the spatial pyramid match
kernel [Lazebnik et al., 2006] to produce detections for those
objects that were not captured by the previous approach.
We designed a peripheral-foveal vision system that at-
tempts to improve the quality of robot-collected imagery by
locating interesting regions of the environment and imag-
ing these regions in high resolution.
This design choice
was inspired by the human visual system, which makes
extensive use of peripheral-foveal vision.
Our peripheral
camera was a Point Grey Research Bumblebee stereo cam-
era with a relatively wide ﬁeld of view. Spectral saliency
[Hou and Zhang, 2007] was fused with stereo depth informa-
tion to locate regions of interest in peripheral images. The
foveal camera was a Canon G7 point-and-shoot camera. We
employed the G7’s high zoom, combined with a pan-tilt unit
to obtain tightly cropped, high resolution images of interest-
ing objects identiﬁed in the peripheral view. We found that
the image quality obtained by our foveal system signiﬁcantly
improved object recognition performance. This is likely due
to the fact that Internet images are also often captured by
high-quality digital cameras.
3.4
Beneﬁts to Our Research
UBC’s participation in the SRVC has lead us to develop Cu-
rious George, a powerful evaluation platform that enabled
further development of embodied recognition algorithms. In
terms of quantiﬁable research output, the platform devel-
oped directly for the SRVC contest has lead to a num-
ber of publications [Meger et al., 2007; Meger et al., 2008]
and several higher-level algorithms have since been de-
signed which leverage the platform [Forssen et al., 2008;
Viswanathan et al., 2009]. Our resulting research directions
can be summarized into three categories:
the effect of
viewpoint in object recognition, the use of existing online
databases for semantic training information, and the use of
additional cues available to an embodied platform during
scene understanding.
Our study of viewpoint in object recognition has examined
the implications of having only a single canonical viewpoint
in the training image dataset (as is often the case with Internet
images). We evaluated several recognition methods (namely
feature matching with and without a geometric constraint) in
terms of their ability to recognize objects from a range of
viewpoints, and reported a range of success for this task.
We showed that annotated datasets such as the LabelMe
[Russell et al., 2008] database can provide semantic informa-
tion for tasks other than simply object recognition. Object-
place relations from LabelMe (e.g., fridges are likely to be
found in the kitchen) were learned, and used this spatial-
semantic model to perform place labeling in simulated envi-
ronments. We also described the use of this model to inform
object search [Viswanathan et al., 2009]. Our future plans are
to combine this technology with object recognition, demon-
strated in the SRVC, to construct a successful integrated scene
understanding system.
Finally, we have employed structure from stereo to register
object locations and construct a 3D object map and demon-
strated how this object map allows a robot to collect multiple
viewpoints of target objects to improve classiﬁcation accu-
racy. One of our team members is currently employing the
raw structure information to utilize scale priors for object
recognition. Overall, the SRVC has stimulated a wide variety
of excellent research in our group by forcing us to examine
object recognition in a realistic setting.
4
Improving Research Outcomes
Research competitions should advance the state-of-the-art by
providing additional training data and context information. In
addition, realistic environments would allow the use of more
advanced learning methods. This section provides potential
modiﬁcations to the SRVC contest that we believe will en-
courage these directions.
4.1
Training
Embodied object recognition systems require a source of
training data from which to learn the appearance and prop-
erties of target objects. In the past, for the SRVC, this data
has been obtained entirely using the Internet at the time of
competition.
However, the vast majority of images from
the Internet are from a single canonical viewpoint, which
implies that the resulting classiﬁer will only be successful
on that viewpoint. Given the paucity of the data, this set-
ting does not encourage 3D recognition, which may be re-
quired for successful embodied recognition. One modiﬁca-
tion would be to allow competitors to know a superset of
the classes beforehand, enabling the use of manually labeled
training data. This is not an unrealistic scenario since most
robots will likely be deployed in known environments where
the set of objects can be carefully catalogued. Also, it still
presents a signiﬁcant challenge, as demonstrated by the VOC
competition where recognition is still very poor. Alterna-
tively, the types of environments (e.g., ofﬁce, kitchen, bed-
room, etc) could be provided. For example, knowing that
the scene was a kitchen would allow researchers to con-
struct priors on appearance, 3d shape and scale for all ob-
jects that are likely to occur in a kitchen.
In either case,
Internet data acquisition would still be allowed at competi-
tion time to augment data provided by system designers. This
would allow for research into the interplay between scene in-
formation like surface orientations and real-world scale and
appearance, similar to works such as [Hoiem et al., 2006;
Gould et al., 2008].
4.2
Environment and Context
The SRVC contest environment has, so far, required the robot
to navigate in an area that is quite small and to locate objects
that were placed on tables covered with white table cloths.
This scenario presents a much simpler segmentation problem
when compared with a realistic home environment, and does
not allow for evaluation of system performance over long dis-
tances and operating durations.
Thus, while object recognition methods that rely on good
segmentation results might succeed in the contest, they are
likely to fail in more realistic environments. This outcome is
misaligned with the objective of pushing research in the di-
rection of improving real-world performance, and we believe
that future SRVC contests should include increasingly realis-
tic environment designs.
To a na¨ıve audience, embedding the competition in a re-
alistic environment might seem likely to increase difﬁculty,
however, it can actually lead to better performance if the
additional information available about context is leveraged
by the competition systems.
Respecting relationships of
co-occurance and co-location of natural environments when
placing objects would allow one to exploit these relationships
for object recognition.
It would also help eliminate false
matches by recognizing that an object does not belong in a
particular location.
In addition, it would be interesting to partition the envi-
ronment into places that appear in real environments (e.g.,
kitchen, bedroom, etc.) and having query objects in the lo-
cations that they are normally found. This would allow com-
petitors to exploit object-place relations to identify potential
object locations, thus facilitating efﬁcient coverage of the en-
vironment. There are obviously logistical problems in hav-
ing a multi-room environment and allowing for an audience.
However, using dividers, it is possible to create room-like
subdivisions without the need for entirely separate rooms.
This would create an environment similar to many “open con-
cept” homes. It is also possible to create recognizable, logical
locations in a single room by separating these locations with
empty space, however this should be speciﬁed to competitors.
5
Improving Participation
The purpose of a competition in research is to provide both
an opportunity to exchange ideas as well as a venue to evalu-
ate and encourage state-of-the-art research. A particular chal-
lenge in an embodied recognition competition is to encour-
age participation of both robotics and vision researchers. In
this section, we discuss practical suggestions to increase re-
searcher participation.
5.1
Changes in the Setting and Rules
Various methods are currently used in object class recog-
nition research such as colour, contours, texture, etc.
In
addition,
there
is
an
active
research
community
[Vogel and Murphy, 2007] that seeks to utilize scene context
for recognition. We propose varying the difﬁculty and scoring
of the competition in a way that rewards the successes of spe-
ciﬁc methods on certain object types that might be challeng-
ing to recognize using simple object recognition techniques.
Another interesting modiﬁcation might be to provide dif-
ferent levels of information before and during the competi-
tion. For example, the object type “bottle” could be provided
beforehand, and the robot might be required to recognize a
speciﬁc object (e.g. coke bottle, milk bottle, etc) during the
competition. To make the problem more challenging, the con-
test could allot points for identifying unknown objects (i.e.
those that do not appear on the list). In addition, including
relative location information for some objects (e.g. the book
is beside the TV) can provide context information useful for
recognition of objects that are particularly challenging given
the state-of-the-art.
Additionally, the contest could allow two teams to com-
pete simultaneously. The team which ﬁnds the objects in the
environment faster would receive a higher score. Another in-
teresting case that can push forward the robotics aspect of
the contest would be to allow multiple robots per team to ex-
plore the environment. The robots can cooperatively capture
the images from different viewpoints and share the informa-
tion to recognize the objects more precisely. However, this
change is most likely infeasible to implement in the near fu-
ture due to the complexity and cost of robots currently being
used in SRVC.
5.2
Software League
Although a competition which requires the integration of var-
ious research areas is desirable, such a competition discour-
ages participation from smaller research groups that may not
have the expertise to implement every aspect required for suc-
cess. The software league is an example of separating the
recognition task from the robotics challenges of active vision
and navigation, however some modiﬁcations are needed in
order to improve participation in this league.
The ﬁrst thing to note is that the impact of this competition
is dependent on the signiﬁcance of the results in the compe-
tition. In the object recognition community, techniques are
evaluated on a large number of images, thus ensuring that im-
provements over previous techniques are statistically signiﬁ-
cant. This is the case even in a competition environment like
VOC. Results from the SRVC competition, however, carry
little statistical signiﬁcance due to the small sample size (e.g.
one mug in the environment). One possibility to address this
is in the software league. Here, image data can be acquired
from real environments instead of the contest setting. This
provides an opportunity to include much more realistic con-
text. Images of the same target objects distributed in a natural
environment, such as a kitchen or ofﬁce, can be taken ahead
of time. In order to incorporate the “embodied vision” aspect
of the contest, additional information such as a map of the
environment, the location and orientation of the camera for
each image, and stereo image pairs can be provided with little
extra effort. This would make the software league a more in-
teresting research problem and help distinguish it from other
object recognition competitions, thus attracting more partici-
pants. In addition, removing the limitations of data collection
by a robot also allows for the creation of data sets composed
of a larger number of objects and environments, thus increas-
ing the statistical signiﬁcance of the results.
5.3
Robot League
As already mentioned, it has been a signiﬁcant challenge for
teams in previous years of the SRVC contest to achieve re-
liable navigation within the contest environment. Since the
primary research problems posed by the SRVC are not in-
tended to focus on low-level robot navigation, it may be use-
ful to consider relieving teams of the navigation burden in the
future.
First, the contest organizers could provide entrants with a
standardized robot platform that has basic navigation abili-
ties. In this case, teams would only be responsible for higher
level task planning and processing of the visual imagery ob-
tained by the robot. While this solution solves many of the
problems posed by navigation, it also unfortunately intro-
duces several complications. Primarily, each team depends on
a slightly different set of sensing modalities. During the 2007
and 2008 SRVC contests, we have seen: monocular video
cameras, monocular still cameras, laser rangeﬁnders, sonar
range sensors, binocular stereo cameras, and multi-camera
stereo systems. Any standardized test platform would be re-
quired to provide teams with some subset of these sensors,
and this set would ideally be sufﬁciently large so that it does
not discourage any teams from competing. Another signif-
icant challenge is the ability for each team to practice and
develop on the standard platform. Either numerous platforms
would need to be distributed, or teams would require peri-
odic access to a single platform. Both of these options en-
tail signiﬁcant cost that would need to be minimized. This
could likely be accomplished by employing a standard robot
architecture such as ROS (http://pr.willowgarage.
com/wiki/ROS) that would allow much of the develop-
ment to occur in simulation and with surrogate robots for
hardware testing.
A simpler method for reducing navigation challenges is to
provide teams with a more detailed speciﬁcation of the con-
test environment geometry. For example, knowing the exact
size and shape of furniture allows for proper mounting of sen-
sors and tuning of sensor models in mapping algorithms. In
this case, it might be possible for each team to still employ
their own robot.
5.4
Facilitating Code Re-use
Re-usable code is an important output of a successful com-
petition, as it is in any collaborative effort. Since the goal of
competitions is to move the state-of-the art in a desired direc-
tion, successive solutions to the competition’s problem ben-
eﬁt from having previous work available as a starting point.
Re-usable code also lowers the barrier to entry for teams new
to the competition, enhancing accessibility of the competi-
tion, and in turn visibility.
Although code sharing is encouraged/required by SRVC,
subsequent re-use of this code appears to be non-existent.
This is a result of the differences in platforms and approaches
used by each of the participants. For example, our robot base,
sensor package, peripheral-foveal vision system, and multi-
processor distributed recognition system was a very speciﬁc
point in the solution space. This entire setup would likely
have to be replicated in order to re-use our code. However,
there are elements of a code base that would be generally
usable (training set construction and feature extraction, for
example).
One possible solution would be to require the use of an
open source robotics package with a distributed architecture
such as ROS, which allows different components to be easily
chained together. In such a system, the different components
are unaware of each other, so they can be mixed-and-matched
at will. Standardization on a single robotics platform would
be an optimal solution if the funding were available.
Without the logistical problems inherent in hardware, the
software league offers much greater potential for code shar-
ing. One possibility to help encourage this would be to design
the software challenges to be explicitly modular in nature. In-
stead of a single software league challenge, it could be broken
into steps such as download and ﬁltering, classiﬁcation, and
localization.
5.5
Funding and Visibility
A signiﬁcant constraint for both organization and participa-
tion is funding. It requires a large amount of exhibition space
in which to set up the environment. Obstacles and target ob-
jects for the environment must be purchased. Support must
be offered to teams to subsidize the cost of shipping their
robots to the competition in order to encourage participa-
tion. In addition, travel costs for the robot teams can be high
since a large number of team members may be needed to run
and maintain the robotic hardware. Clearly, secure funding
through public and private sponsorship will attract participa-
tion.
Aside from attracting more participation from research
groups, improving the visibility of the contest can attract
sponsorship.
One simple change to the 2008 competition
that was surprisingly compelling was the addition of bonus
points for teams who made a realtime status display showing
matches as they were made. This made the contest more in-
teresting for the audience to watch by providing a sense of
what the robots were doing even when they were not mov-
ing. The crowd was audibly excited when a new match was
displayed, identifying with the robots and responding to the
irregular reinforcement aspect of the display. As an element
of visibility and outreach for a robotic competition, this is a
very powerful lesson. In addition, this also pushed research
towards techniques to provide real-time recognition. In fu-
ture, explicitly encouraging competitors to provide real-time
displays of what the robot has found or is trying to do will
draw even more attention to the contest.
6
Conclusions
Properly designed contests signiﬁcantly promote the develop-
ment of the state-of-the-art. They can comprise realistic and
complex settings not seen in standard benchmark datasets,
providing both a strong test for current solutions and rich con-
text that can be leveraged to advance research. The Seman-
tic Robot Vision Challenge represents one such competition
which provides a venue for embodied recognition. It has pro-
vided a valuable impetus to our own research, providing in-
sights and elucidating new directions that need more research.
However, to be successful in the future, this contest needs nu-
merous modiﬁcations in order to have signiﬁcant impact.
References
[Fergus et al., 2004] R. Fergus, P. Perona, and A. Zisserman.
A visual category ﬁlter for google images. In Proceed-
ings of the 8th European Conference on Computer Vision,
Prague, Czech Republic, pages 242–256, May 2004.
[Forssen et al., 2008] P. E. Forssen, D. Meger, K. Lai,
S. Helmer, J. J. Little, and D. G. Lowe. Informed visual
search: Combining attention and object recognition. In
Proceedings of ICRA, May 2008.
[Gould et al., 2008] Stephen Gould, Paul Baumstarck, Mor-
gan Quigley, , Andrew Y. Ng, and Daphne Koller. Integrat-
ing visual and range data for robotic object detection. In
ECCV Workshop on Multi-camera and Multi-modal Sen-
sor Fusion Algorithms and Applications (M2SFA2), 2008.
[Hoiem et al., 2006] D. Hoiem, A.A. Efros, and M. Heber.
Putting objects in perspective. In CVPR, 2006.
[Hou and Zhang, 2007] Xiaodi Hou and Liqing Zhang.
Saliency detection: A spectral residual approach. Com-
puter Vision and Pattern Recognition, IEEE Computer So-
ciety Conference on, 0:1–8, 2007.
[Ke et al., 2006] Yan Ke, Xiaoou Tang, and Feng Jing. The
design of high-level features for photo quality assess-
ment. In CVPR ’06: Proceedings of the 2006 IEEE Com-
puter Society Conference on Computer Vision and Pat-
tern Recognition, pages 419–426, Washington, DC, USA,
2006. IEEE Computer Society.
[Kitano et al., 1997] Hiroaki Kitano, Minoru Asada, Yasuo
Kuniyoshi, Itsuki Noda, and Eiichi Osawa. RoboCup: The
robot world cup initiative. In W. Lewis Johnson, editor,
Proceedings of the First International Conference on Au-
tonomous Agents, New York, 1997. ACM Press.
[Lazebnik et al., 2006] S. Lazebnik, Cordelia Schmid, and
Jean Ponce.
Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories.
In
CVPR, 2006.
[Mackworth, 1993] Alan K. Mackworth. On seeing robots.
In A. Basu and X. Li, editors, Computer Vision: Sys-
tems, Theory and Applications, pages 1–13. World Scien-
tiﬁc Press, Singapore, 1993. Reprinted in P. Thagard (ed.),
Mind Readings, MIT Press, 1998.
[Meger et al., 2007] David Meger, Per-Erik Forss´en, Kevin
Lai, Scott Helmer, Tristram Southey Sancho McCann,
Matthew Baumann, James J. Little, David G. Lowe, and
Bruce Dow. Curious george: An attentive semantic robot.
In IROS 2007 Workshop: From sensors to human spatial
concepts, San Diego, CA, USA, November 2007. IEEE.
[Meger et al., 2008] D.
Meger,
P.-E.
Forssn,
K.
Lai,
S. Helmer, S. McCann, T. Southey, M. Baumann, J. J. Lit-
tle, and D. G. Lowe. Curious george: An attentive se-
mantic robot. Robotics and Autonomous Systems Journal,
Special Issue From Sensors to Human Spatial Concepts,
June 2008.
[Russell et al., 2008] B. Russell, A. Torralba, K. Murphy,
and W. Freeman. Labelme: a database and web-based tool
for image annotation. International Journal of Computer
Vision (special issue on vision and learning), 77(1-3):157
– 173, 2008.
[Seetharaman et al., 2006] Guna Seetharaman, Arun Lakho-
tia, and Erik Philip Blasch. Unmanned vehicles come of
age: The darpa grand challenge. Computer, 39(12):26–29,
2006.
[Viswanathan et al., 2009] Pooja
Viswanathan,
David
Meger, Tristram Southey, James J. Little, and Alan
Mackworth. Automated spatial-semantic modeling with
applications to place labeling and informed search.
In
Proceedings of Canadian Robot Vision, Kelowna, Canada,
2009.
[Vogel and Murphy, 2007] Julia Vogel and Kevin Murphy. A
non-myopic approach to visual search. In Proceedings of
the Fourth Canadian Conference on Computer and Robot
Vision CRV, Montreal, Canada, May 2007.