Reasonable Perception: Connecting Vision and Language Systems
An image with an accompanying observation “The table has a few books on it, and the table has a few books on it.”. The data was collected from people making observations in VR environments.
Input: "The table has a few books on it, and the table itself is white."
This perception is REASONABLE.
REASONING:
Books are typically found at the same location as a table. So it is reasonable for books to be located on a table.
Furthermore, tables are typically made of materials that can be colored white. So it isreasonable for a table to be white.
Leilani H. Gilpin, Cagri Zaman, Danielle Olson, and Ben Z. Yuan. 2018. Reasonable Perception: Connecting Vision and Language Systems for Validating Scene Descriptions. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI '18). Association for Computing Machinery, New York, NY, USA, 115–116. https://doi.org/10.1145/3173386.3176994 Link
Understanding explanations of machine perception is an important step towards developing accountable, trustworthy machines. Furthermore, speech and vision are the primary modalities by which humans collect information about the world, but the linking of visual and natural language domains is a relatively new pursuit in computer vision, and it is difficult to test performance in a safe environment. To couple human visual understanding and machine perception, we present an explanatory system for creating a library of possible context-specific actions associated with 3D objects in immersive virtual worlds. We also contribute a novel scene description dataset, generated natively in virtual reality containing speech, image, gaze, and acceleration data. We discuss the development of a hybrid machine learning algorithm linking vision data with environmental affordances in natural language. Our findings demonstrate that it is possible to develop a model which can generate interpretable verbal descriptions of possible actions associated with recognized 3D objects within immersive VR environments.