arXiv:0908.2656v1 [cs.CV] 19 Aug 2009 Semantic Robot Vision Challenge: Current State and Future Directions Scott Helmer, David Meger, Pooja Viswanathan, Sancho McCann, Matthew Dockrey, Pooyan Fazli, Tristram Southey, Marius Muja, Michael Joya, Jim Little, David Lowe, Alan Mackworth Department of Computer Science University of British Columbia shelmer@cs.ubc.ca Abstract The Semantic Robot Vision Competition provided an excellent opportunity for our research lab to integrate our many ideas under one umbrella, in- spiring both collaboration and new research. The task, visual search for an unknown object, is rel- evant to both the vision and robotics communities. Moreover, since the interplay of robotics and vision is sometimes ignored, the competition provides a venue to integrate two communities. In this paper, we outline a number of modifications to the com- petition to both improve the state-of-the-art and in- crease participation. 1 Introduction Current technology (robotic and otherwise) falls well short of a human’s ability to perceive the world using vision. A nearly limitless range of applications would be facilitated by successful embodied object recognition (i.e., the ability of a mobile platform to perform human-like visual scene and object understanding). We believe that with several key ad- vances in the ability of a computer system to interpret visual imagery, namely robust object recognition of a large num- ber of object classes and more capable scene understanding, future robot systems will substantially enhance the lives of their users. A robot introduced into a home environment will quickly be able to respond to commands such as ”Robot, fetch my shoes!”, assistive mobility devices will be able to deter- mine whether a dangerous object is in the user’s path, and navigation systems will aid travelers by identifying accidents and construction delays. In order to accelerate the progress of state-of-the-art re- search, many fields in science and engineering have employed standardized benchmarks or data sets to evaluate similar tech- niques and provide a means for their comparison. However, these measures can be detrimental when they do not reflect the reality or complexity of the problem in question. If the benchmark represents a severe simplification of reality, its use for evaluation of techniques may lead to overconfidence in a system’s accuracy and robustness. In addition, they may dis- courage research directions that are not aligned with success on such benchmarks. Research competitions, while potentially possessing these same limitations, are more desirable than standard bench- marks on many points. First, like standard benchmarks they provide a context in which participants can evaluate their techniques under uniform conditions and make meaningful comparisons. Second, their periodic nature allows the com- petition to evolve with the state-of-the-art, and discourages techniques tailored to specifics of a particular benchmark. Fi- nally, they provide an exciting venue that brings together a community for collaboration and synthesis. However, this as- sumes that those engaged in the state-of-the-art research par- ticipate actively in such competitions, otherwise the events become merely a venue for displaying known techniques. Although there have been competitions focused upon a va- riety of robotic tasks, these have tended to minimize the con- tribution of vision. Conversely, in vision, particularly ob- ject recognition, the active acquisition of images for anal- ysis is generally of secondary concern. Here, benchmark datasets and competitions are neither a representative sam- ple of the real world, nor a sample of how a robot would see the world. By separating robotics and vision, the majority of cutting-edge object recognition research has focused only upon appearance-based approaches, ignoring scene cues that may prove beneficial for both accuracy and efficiency. We be- lieve that in order to push state-of-the-art methods towards the challenging goals outlined earlier in this paper, a competition must bring these communities together by evaluating embod- ied object recognition systems in realistic environments, and thus reducing over-simplifications and erroneous research di- rections. A recent competition featuring embodied object recogni- tion is the Semantic Robot Vision Challenge (SRVC). The overall task in this contest is similar to a photo-scavenger hunt in an unknown indoor environment, with information on the objects typically acquired from the Internet. This setting brings together numerous sub-fields of AI, including vision, robotics, and natural language processing, along with Internet search technologies. Although this competition does involve embodied vision and can help stimulate robotics and vision research, it has yet to gain notoriety in the research commu- nity and significantly advance the state-of-the-art. Drawing upon our experience as a competitor in the SRVC for the past two years, we have identified issues in both robotic competitions and embodied recognition. We provide an outlook for the future of the SRVC that will allow it to increase its impact on the community. Our contribution in this respect is two-fold. Firstly, we discuss the value of the existing SRVC competition to research in embodied vision, and how it has pushed our own research in new directions. Secondly, we review possible modifications to improve the competition in terms of the research directions it encourages and the number of participants it attracts. 2 Robotics and Computer Vision Competitions Competitions in robotics and computer vision that display state-of-the-art techniques are a relatively recent phenomena. This is partially due to the fact that historically, the state-of- the-art in either domains were not mature enough to handle compelling tasks. However, beginning with Robocup, com- petitions have become a somewhat regular feature at both aca- demic conferences and independent venues. It is worth taking a moment to consider some of the more successful competi- tions and the features that have made them relevant and vi- able. The premier example of a successful competition is Robocup [Kitano et al., 1997], pioneered by Alan Mack- worth [Mackworth, 1993], where robots compete against each other in a soccer-like setting. With an over-arching goal of having robots compete against humans in the mid-21 cen- tury, Robocup has proved to be a valuable education tool and testbed for many ideas in AI. One of the key features in its early success was that it offered a variety of leagues for par- ticipation. Robot and simulation leagues were offered, pro- viding a venue for state-of-the-art research in robot control as well as techniques in planning and multi-agent systems. As a result, the competition has attracted a large number of partic- ipants, raising the profile of attendant research and providing a valuable research experience. More recently, RoboCup@Home is a new RoboCup league which aims to develop service and assistive robots used in real-world personal domestic applications. The intent of the league is to promote the development of robotic technologies that can assist humans in everyday life. The competition pro- poses a number of benchmark tasks in a home environment, where success is determined by the number of tasks which the entrant’s robot completes. Among one of the benchmarks is a task to find a specified object in the environment. Although this contest does contain some aspects of embodied vision, it does not offer a sufficiently challenging task to attract vision researchers to a competition that is not held in conjunction with AI or vision conferences. It does, however, offer oppor- tunities for teams to attempt a wide variety of tasks, each re- quiring expertise in different areas of research. For example, while one task might require speech synthesis and aesthetic presentation, another might evaluate teams on safe naviga- tion, tracking and human recognition. This setup provides teams the flexibility to attempt specific tasks that they have research expertise in and opt out of others. Another wildly successful competition in AI was the DARPA Grand Challenge [Seetharaman et al., 2006], offer- ing one million dollars to the first team which could au- tonomously complete a 240 kilometer on- and off-road course. For the first Grand Challenge in 2004 the best com- petitor traveled just 11 kilometers before flipping over and catching on fire. In the following year, there were five ve- hicles which successfully completed the entire course. In 2007, just three and a half years after the first competition, six teams finished the Urban Challenge, which mixed robotic and non-robotic vehicles together in an urban setting and enforced California traffic laws. The Grand Challenge is the perfect il- lustration of a competition which pushed the state-of-the-art, particularly in systems engineering. Prior to the competition it was widely believed that current technology was simply not up to the challenge of this difficult task. This success was due in part to the fact that from the start it was well funded, at- tracted top-notch research institutions, received wide media attention, and provided a compelling task. Another successful competition has arisen in the com- puter vision community, the Pascal Visual Object Classes Challenge (VOC) (http://pascallin.ecs.soton. ac.uk/challenges/VOC/). This is a EU-funded com- petition which began in 2005 with the goals of providing a yearly competition for object class recognition and localiza- tion and a set of standards and tools for evaluating algorithm performance. In contrast to standard benchmark datasets, en- trants are evaluated on a novel dataset every year, which pre- vents algorithms from being tailored specifically to a single data set. The key feature of this competition that led to its success was that it involved high profile researchers at the organizing level, was held in conjunction with major confer- ences, and was relatively inexpensive to participants. 3 SRVC and Our Experience Although the previously mentioned competitions have been successful, they do not address many of the issues of em- bodied vision. The SRVC is an ideal competition to push the state-of-the-art in this field. This competition was held for the first time at the Association for the Advancement of Artificial Intelligence (AAAI) conference in 2007 in Vancou- ver, and again in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2008 in Alaska. The competition is a visual search task in an un- known environment. The entrants are given a list of objects to find in the environment, with the environment containing only a subset of those listed objects along with additional distractor objects. Using this list, the robots autonomously acquire data about these objects from the Internet in a fixed amount of time. Once data collection and learning are com- plete the robot searches the unknown environment with the task of finding the objects using the data acquired from the Internet. At the end of the exploration phase the robot returns an image for each object type containing a single bounding box around the target. The scoring is based on the bounding box accuracy. In addition, there is a software league, where the entrants are not responsible for acquiring images of the environment. Instead they are given a set of images taken of the environment that include both the objects and other scene elements. Our team entered two versions of our Curious George robot to the 2007 and 2008 SRVC. We gained a wealth of experi- ence during our system development process and actual con- test participation, which will be described in the following section. 3.1 Internet Data Collection and Filtering In the SRVC, all of the training data is acquired from the In- ternet at contest time with no human intervention. For visual appearance models, this generally means collecting a dataset of images via an Internet image search engine like Google. Given the varied nature of Internet image search results, a system was needed to filter the output before training could be performed. We implemented two phases of training data filtering. The first phase removed all cartoons, illustrations, technical schematics, and other non-photographic images us- ing a quality score developed in [Ke et al., 2006]. The second phase prioritized groups of images that displayed a high de- gree of similarity, since we determined empirically that these images were more likely to contain the target object. It is interesting to note that approaches for image filtering and ranking such as [Fergus et al., 2004] are generally evalu- ated on a dataset drawn from a similar distribution to the train- ing data. This is not the case for the SRVC scenario since data collected by a robot is not likely to be from canonical view- points in uncluttered backgrounds – two properties that are common in Internet images. As a result, we did not pre-filter results based on generic object categorization techniques. In addition to acquiring training images, other relevant data may be found on the Internet, such as contextual clues from LabelMe [Russell et al., 2008] and size priors from the Wal- Mart catalog. These all represent important sources of infor- mation for recognition purposes, but this potential has been left unaddressed in the literature. Although we were not able to integrate this information into our system in time for the SRVC competitions, they have encouraged us to pursue this direction for collecting additional training data aside form the traditional image datasets. 3.2 Robotics The nature of a robotics contest demands the construction of a physical system with numerous abilities ranging from basic navigation, to the construction of a distributed computation system, to performance on the task itself (object recognition in our case). While each of these individual tasks are easily achieved, their integration within a physical system presents a high level of complexity for system designers. For example, from a research perspective, robot navigation and mapping is largely a solved problem in indoor environments. However, it is a significant practical challenge to prepare a robot to navi- gate a previously unseen contest environment where it is re- quired to visit potentially unsafe locations that provide good views of objects. Similarly, distributing a computational pro- cess across several networked processors is not a significant challenge in many situations, but when this system must be mounted on a mobile robot and thus subject to constraints on weight, power and size, many difficulties present themselves. For the 2008 SRVC, we developed an active exploration and real-time vision system in order to announce objects that were discovered in real-time during the run. Our robot’s ar- chitecture involved the low power, on-board PC mounted in- side our Pioneer AT3 robot for low-level control, and four net- worked laptop systems responsible for: i) real-time process- ing of visual imagery, visual attention, robotic planning, gaze planning, and overall control; ii) specific object recognition; iii+iv) generic category recognition. The distributed nature of this architecture required captured imagery to be transfered between computers via network connection, and the associ- ated software for sending and receiving components. Our visual attention system processed imagery obtained from a stereo camera system in real-time in order to deter- mine the locations of interesting objects and structures in the environment. This represented a significant new functionality when compared with our 2007 contest entry that required the robot to “stop and shoot” before performing visual attention. The real-time functionality prevented our robot from “driving blindly” and allowed it to continuously monitor the periph- eral view until a sufficiently interesting location was seen, at which point foveal images could be collected. This behaviour allowed the robot to cover the environment rapidly while ig- noring uninteresting regions and thus capturing images of a large number of candidate objects. It is unlikely that any of our team members would have developed such a behaviour had it not been for the SRVC, since stationary visual atten- tion is equally easy to demonstrate in academic publication. However, now that such a system has been developed, our team has the ability to evaluate the behaviour of interactive, real-time visual attention on a mobile platform, and this con- tinues to be an interesting research direction for our research group. 3.3 Vision Images collected by a robot during the embodied object recognition scenario often capture objects from a non- standard viewpoint, scale, or orientation. In other cases, the images do not contain an object at all. In fact, during our SRVC experience, we found that images collected by the robot rarely contained any target object. As a result, we de- signed our classification system to have a low false positive rate. We employed a two-stage object detection approach. The first stage used a specific object recognition system based on matching SIFT features and geometric consistency that generally produced few false positives but provided low re- call for generic object classes. The second stage employed a generic object classifier based on the spatial pyramid match kernel [Lazebnik et al., 2006] to produce detections for those objects that were not captured by the previous approach. We designed a peripheral-foveal vision system that at- tempts to improve the quality of robot-collected imagery by locating interesting regions of the environment and imag- ing these regions in high resolution. This design choice was inspired by the human visual system, which makes extensive use of peripheral-foveal vision. Our peripheral camera was a Point Grey Research Bumblebee stereo cam- era with a relatively wide field of view. Spectral saliency [Hou and Zhang, 2007] was fused with stereo depth informa- tion to locate regions of interest in peripheral images. The foveal camera was a Canon G7 point-and-shoot camera. We employed the G7’s high zoom, combined with a pan-tilt unit to obtain tightly cropped, high resolution images of interest- ing objects identified in the peripheral view. We found that the image quality obtained by our foveal system significantly improved object recognition performance. This is likely due to the fact that Internet images are also often captured by high-quality digital cameras. 3.4 Benefits to Our Research UBC’s participation in the SRVC has lead us to develop Cu- rious George, a powerful evaluation platform that enabled further development of embodied recognition algorithms. In terms of quantifiable research output, the platform devel- oped directly for the SRVC contest has lead to a num- ber of publications [Meger et al., 2007; Meger et al., 2008] and several higher-level algorithms have since been de- signed which leverage the platform [Forssen et al., 2008; Viswanathan et al., 2009]. Our resulting research directions can be summarized into three categories: the effect of viewpoint in object recognition, the use of existing online databases for semantic training information, and the use of additional cues available to an embodied platform during scene understanding. Our study of viewpoint in object recognition has examined the implications of having only a single canonical viewpoint in the training image dataset (as is often the case with Internet images). We evaluated several recognition methods (namely feature matching with and without a geometric constraint) in terms of their ability to recognize objects from a range of viewpoints, and reported a range of success for this task. We showed that annotated datasets such as the LabelMe [Russell et al., 2008] database can provide semantic informa- tion for tasks other than simply object recognition. Object- place relations from LabelMe (e.g., fridges are likely to be found in the kitchen) were learned, and used this spatial- semantic model to perform place labeling in simulated envi- ronments. We also described the use of this model to inform object search [Viswanathan et al., 2009]. Our future plans are to combine this technology with object recognition, demon- strated in the SRVC, to construct a successful integrated scene understanding system. Finally, we have employed structure from stereo to register object locations and construct a 3D object map and demon- strated how this object map allows a robot to collect multiple viewpoints of target objects to improve classification accu- racy. One of our team members is currently employing the raw structure information to utilize scale priors for object recognition. Overall, the SRVC has stimulated a wide variety of excellent research in our group by forcing us to examine object recognition in a realistic setting. 4 Improving Research Outcomes Research competitions should advance the state-of-the-art by providing additional training data and context information. In addition, realistic environments would allow the use of more advanced learning methods. This section provides potential modifications to the SRVC contest that we believe will en- courage these directions. 4.1 Training Embodied object recognition systems require a source of training data from which to learn the appearance and prop- erties of target objects. In the past, for the SRVC, this data has been obtained entirely using the Internet at the time of competition. However, the vast majority of images from the Internet are from a single canonical viewpoint, which implies that the resulting classifier will only be successful on that viewpoint. Given the paucity of the data, this set- ting does not encourage 3D recognition, which may be re- quired for successful embodied recognition. One modifica- tion would be to allow competitors to know a superset of the classes beforehand, enabling the use of manually labeled training data. This is not an unrealistic scenario since most robots will likely be deployed in known environments where the set of objects can be carefully catalogued. Also, it still presents a significant challenge, as demonstrated by the VOC competition where recognition is still very poor. Alterna- tively, the types of environments (e.g., office, kitchen, bed- room, etc) could be provided. For example, knowing that the scene was a kitchen would allow researchers to con- struct priors on appearance, 3d shape and scale for all ob- jects that are likely to occur in a kitchen. In either case, Internet data acquisition would still be allowed at competi- tion time to augment data provided by system designers. This would allow for research into the interplay between scene in- formation like surface orientations and real-world scale and appearance, similar to works such as [Hoiem et al., 2006; Gould et al., 2008]. 4.2 Environment and Context The SRVC contest environment has, so far, required the robot to navigate in an area that is quite small and to locate objects that were placed on tables covered with white table cloths. This scenario presents a much simpler segmentation problem when compared with a realistic home environment, and does not allow for evaluation of system performance over long dis- tances and operating durations. Thus, while object recognition methods that rely on good segmentation results might succeed in the contest, they are likely to fail in more realistic environments. This outcome is misaligned with the objective of pushing research in the di- rection of improving real-world performance, and we believe that future SRVC contests should include increasingly realis- tic environment designs. To a na¨ıve audience, embedding the competition in a re- alistic environment might seem likely to increase difficulty, however, it can actually lead to better performance if the additional information available about context is leveraged by the competition systems. Respecting relationships of co-occurance and co-location of natural environments when placing objects would allow one to exploit these relationships for object recognition. It would also help eliminate false matches by recognizing that an object does not belong in a particular location. In addition, it would be interesting to partition the envi- ronment into places that appear in real environments (e.g., kitchen, bedroom, etc.) and having query objects in the lo- cations that they are normally found. This would allow com- petitors to exploit object-place relations to identify potential object locations, thus facilitating efficient coverage of the en- vironment. There are obviously logistical problems in hav- ing a multi-room environment and allowing for an audience. However, using dividers, it is possible to create room-like subdivisions without the need for entirely separate rooms. This would create an environment similar to many “open con- cept” homes. It is also possible to create recognizable, logical locations in a single room by separating these locations with empty space, however this should be specified to competitors. 5 Improving Participation The purpose of a competition in research is to provide both an opportunity to exchange ideas as well as a venue to evalu- ate and encourage state-of-the-art research. A particular chal- lenge in an embodied recognition competition is to encour- age participation of both robotics and vision researchers. In this section, we discuss practical suggestions to increase re- searcher participation. 5.1 Changes in the Setting and Rules Various methods are currently used in object class recog- nition research such as colour, contours, texture, etc. In addition, there is an active research community [Vogel and Murphy, 2007] that seeks to utilize scene context for recognition. We propose varying the difficulty and scoring of the competition in a way that rewards the successes of spe- cific methods on certain object types that might be challeng- ing to recognize using simple object recognition techniques. Another interesting modification might be to provide dif- ferent levels of information before and during the competi- tion. For example, the object type “bottle” could be provided beforehand, and the robot might be required to recognize a specific object (e.g. coke bottle, milk bottle, etc) during the competition. To make the problem more challenging, the con- test could allot points for identifying unknown objects (i.e. those that do not appear on the list). In addition, including relative location information for some objects (e.g. the book is beside the TV) can provide context information useful for recognition of objects that are particularly challenging given the state-of-the-art. Additionally, the contest could allow two teams to com- pete simultaneously. The team which finds the objects in the environment faster would receive a higher score. Another in- teresting case that can push forward the robotics aspect of the contest would be to allow multiple robots per team to ex- plore the environment. The robots can cooperatively capture the images from different viewpoints and share the informa- tion to recognize the objects more precisely. However, this change is most likely infeasible to implement in the near fu- ture due to the complexity and cost of robots currently being used in SRVC. 5.2 Software League Although a competition which requires the integration of var- ious research areas is desirable, such a competition discour- ages participation from smaller research groups that may not have the expertise to implement every aspect required for suc- cess. The software league is an example of separating the recognition task from the robotics challenges of active vision and navigation, however some modifications are needed in order to improve participation in this league. The first thing to note is that the impact of this competition is dependent on the significance of the results in the compe- tition. In the object recognition community, techniques are evaluated on a large number of images, thus ensuring that im- provements over previous techniques are statistically signifi- cant. This is the case even in a competition environment like VOC. Results from the SRVC competition, however, carry little statistical significance due to the small sample size (e.g. one mug in the environment). One possibility to address this is in the software league. Here, image data can be acquired from real environments instead of the contest setting. This provides an opportunity to include much more realistic con- text. Images of the same target objects distributed in a natural environment, such as a kitchen or office, can be taken ahead of time. In order to incorporate the “embodied vision” aspect of the contest, additional information such as a map of the environment, the location and orientation of the camera for each image, and stereo image pairs can be provided with little extra effort. This would make the software league a more in- teresting research problem and help distinguish it from other object recognition competitions, thus attracting more partici- pants. In addition, removing the limitations of data collection by a robot also allows for the creation of data sets composed of a larger number of objects and environments, thus increas- ing the statistical significance of the results. 5.3 Robot League As already mentioned, it has been a significant challenge for teams in previous years of the SRVC contest to achieve re- liable navigation within the contest environment. Since the primary research problems posed by the SRVC are not in- tended to focus on low-level robot navigation, it may be use- ful to consider relieving teams of the navigation burden in the future. First, the contest organizers could provide entrants with a standardized robot platform that has basic navigation abili- ties. In this case, teams would only be responsible for higher level task planning and processing of the visual imagery ob- tained by the robot. While this solution solves many of the problems posed by navigation, it also unfortunately intro- duces several complications. Primarily, each team depends on a slightly different set of sensing modalities. During the 2007 and 2008 SRVC contests, we have seen: monocular video cameras, monocular still cameras, laser rangefinders, sonar range sensors, binocular stereo cameras, and multi-camera stereo systems. Any standardized test platform would be re- quired to provide teams with some subset of these sensors, and this set would ideally be sufficiently large so that it does not discourage any teams from competing. Another signif- icant challenge is the ability for each team to practice and develop on the standard platform. Either numerous platforms would need to be distributed, or teams would require peri- odic access to a single platform. Both of these options en- tail significant cost that would need to be minimized. This could likely be accomplished by employing a standard robot architecture such as ROS (http://pr.willowgarage. com/wiki/ROS) that would allow much of the develop- ment to occur in simulation and with surrogate robots for hardware testing. A simpler method for reducing navigation challenges is to provide teams with a more detailed specification of the con- test environment geometry. For example, knowing the exact size and shape of furniture allows for proper mounting of sen- sors and tuning of sensor models in mapping algorithms. In this case, it might be possible for each team to still employ their own robot. 5.4 Facilitating Code Re-use Re-usable code is an important output of a successful com- petition, as it is in any collaborative effort. Since the goal of competitions is to move the state-of-the art in a desired direc- tion, successive solutions to the competition’s problem ben- efit from having previous work available as a starting point. Re-usable code also lowers the barrier to entry for teams new to the competition, enhancing accessibility of the competi- tion, and in turn visibility. Although code sharing is encouraged/required by SRVC, subsequent re-use of this code appears to be non-existent. This is a result of the differences in platforms and approaches used by each of the participants. For example, our robot base, sensor package, peripheral-foveal vision system, and multi- processor distributed recognition system was a very specific point in the solution space. This entire setup would likely have to be replicated in order to re-use our code. However, there are elements of a code base that would be generally usable (training set construction and feature extraction, for example). One possible solution would be to require the use of an open source robotics package with a distributed architecture such as ROS, which allows different components to be easily chained together. In such a system, the different components are unaware of each other, so they can be mixed-and-matched at will. Standardization on a single robotics platform would be an optimal solution if the funding were available. Without the logistical problems inherent in hardware, the software league offers much greater potential for code shar- ing. One possibility to help encourage this would be to design the software challenges to be explicitly modular in nature. In- stead of a single software league challenge, it could be broken into steps such as download and filtering, classification, and localization. 5.5 Funding and Visibility A significant constraint for both organization and participa- tion is funding. It requires a large amount of exhibition space in which to set up the environment. Obstacles and target ob- jects for the environment must be purchased. Support must be offered to teams to subsidize the cost of shipping their robots to the competition in order to encourage participa- tion. In addition, travel costs for the robot teams can be high since a large number of team members may be needed to run and maintain the robotic hardware. Clearly, secure funding through public and private sponsorship will attract participa- tion. Aside from attracting more participation from research groups, improving the visibility of the contest can attract sponsorship. One simple change to the 2008 competition that was surprisingly compelling was the addition of bonus points for teams who made a realtime status display showing matches as they were made. This made the contest more in- teresting for the audience to watch by providing a sense of what the robots were doing even when they were not mov- ing. The crowd was audibly excited when a new match was displayed, identifying with the robots and responding to the irregular reinforcement aspect of the display. As an element of visibility and outreach for a robotic competition, this is a very powerful lesson. In addition, this also pushed research towards techniques to provide real-time recognition. In fu- ture, explicitly encouraging competitors to provide real-time displays of what the robot has found or is trying to do will draw even more attention to the contest. 6 Conclusions Properly designed contests significantly promote the develop- ment of the state-of-the-art. They can comprise realistic and complex settings not seen in standard benchmark datasets, providing both a strong test for current solutions and rich con- text that can be leveraged to advance research. The Seman- tic Robot Vision Challenge represents one such competition which provides a venue for embodied recognition. It has pro- vided a valuable impetus to our own research, providing in- sights and elucidating new directions that need more research. However, to be successful in the future, this contest needs nu- merous modifications in order to have significant impact. References [Fergus et al., 2004] R. Fergus, P. Perona, and A. Zisserman. A visual category filter for google images. In Proceed- ings of the 8th European Conference on Computer Vision, Prague, Czech Republic, pages 242–256, May 2004. [Forssen et al., 2008] P. E. Forssen, D. Meger, K. Lai, S. Helmer, J. J. Little, and D. G. Lowe. Informed visual search: Combining attention and object recognition. In Proceedings of ICRA, May 2008. [Gould et al., 2008] Stephen Gould, Paul Baumstarck, Mor- gan Quigley, , Andrew Y. Ng, and Daphne Koller. Integrat- ing visual and range data for robotic object detection. In ECCV Workshop on Multi-camera and Multi-modal Sen- sor Fusion Algorithms and Applications (M2SFA2), 2008. [Hoiem et al., 2006] D. Hoiem, A.A. Efros, and M. Heber. Putting objects in perspective. In CVPR, 2006. [Hou and Zhang, 2007] Xiaodi Hou and Liqing Zhang. Saliency detection: A spectral residual approach. Com- puter Vision and Pattern Recognition, IEEE Computer So- ciety Conference on, 0:1–8, 2007. [Ke et al., 2006] Yan Ke, Xiaoou Tang, and Feng Jing. The design of high-level features for photo quality assess- ment. In CVPR ’06: Proceedings of the 2006 IEEE Com- puter Society Conference on Computer Vision and Pat- tern Recognition, pages 419–426, Washington, DC, USA, 2006. IEEE Computer Society. [Kitano et al., 1997] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, and Eiichi Osawa. RoboCup: The robot world cup initiative. In W. Lewis Johnson, editor, Proceedings of the First International Conference on Au- tonomous Agents, New York, 1997. ACM Press. [Lazebnik et al., 2006] S. Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. [Mackworth, 1993] Alan K. Mackworth. On seeing robots. In A. Basu and X. Li, editors, Computer Vision: Sys- tems, Theory and Applications, pages 1–13. World Scien- tific Press, Singapore, 1993. Reprinted in P. Thagard (ed.), Mind Readings, MIT Press, 1998. [Meger et al., 2007] David Meger, Per-Erik Forss´en, Kevin Lai, Scott Helmer, Tristram Southey Sancho McCann, Matthew Baumann, James J. Little, David G. Lowe, and Bruce Dow. Curious george: An attentive semantic robot. In IROS 2007 Workshop: From sensors to human spatial concepts, San Diego, CA, USA, November 2007. IEEE. [Meger et al., 2008] D. Meger, P.-E. Forssn, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J. J. Lit- tle, and D. G. Lowe. Curious george: An attentive se- mantic robot. Robotics and Autonomous Systems Journal, Special Issue From Sensors to Human Spatial Concepts, June 2008. [Russell et al., 2008] B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: a database and web-based tool for image annotation. International Journal of Computer Vision (special issue on vision and learning), 77(1-3):157 – 173, 2008. [Seetharaman et al., 2006] Guna Seetharaman, Arun Lakho- tia, and Erik Philip Blasch. Unmanned vehicles come of age: The darpa grand challenge. Computer, 39(12):26–29, 2006. [Viswanathan et al., 2009] Pooja Viswanathan, David Meger, Tristram Southey, James J. Little, and Alan Mackworth. Automated spatial-semantic modeling with applications to place labeling and informed search. In Proceedings of Canadian Robot Vision, Kelowna, Canada, 2009. [Vogel and Murphy, 2007] Julia Vogel and Kevin Murphy. A non-myopic approach to visual search. In Proceedings of the Fourth Canadian Conference on Computer and Robot Vision CRV, Montreal, Canada, May 2007.