NavigAid: AI-Driven Orientation and Mobility System for the Blind
NSF SBIR Phase II Project
Role: Principle Investigator
Year: 2020 - Present
There are 253 million people in the world and 8 million people in the US with varying levels of visual impairments. Currently, there are no effective, scalable, and affordable solutions to help the visually impaired and blind understand and navigate non-immediate physical spaces, such as a hospital lobby, a school corridor, or a new office environment.
NavigAid is an AI-driven mobile Orientation and Mobility (O&M) system, which provides contextually relevant, task-driven solutions to problems such as finding objects, identifying paths of ingress and egress, and understanding the layout of an environment. NavigAid is enabled by our core technical innovation, Ally Networks, which represents a novel neural network architecture that is capable of extracting semantically and functionally relevant spatial features from images, which help to create a human-like understanding of physical environments. The ability to obtain useful and timely information about the environment will provide much needed spatial independence to our users and encourage them to navigate even in extremely cluttered environments such as train-stations.
As a result of this project, our solution will give unprecedented independence to one of the most underserved communities. With NavigAid, a visually impaired individual will be able to navigate education and employment spaces more independently. Our solution has the potential to reach the 54-118 million visually impaired people around the world and significantly improve the 50% high school drop-out rate, 74% unemployment rate, and the high depression rates.
Ally Networks are multimodal neural networks that are trained to learn a shared spatial code that are robust in changing environmental conditions.[1]
multimodal ally networks
Solving the spatial and navigational problems of individuals with visual impairment requires an artificial vision system with contextual and spatial awareness. Such a technology should robustly encode spatial information and effectively communicate the pertinent information from the immediate environment. Current technologies fail to provide a satisfactory solution because (1) there is no reliable computer vision algorithm that can encode spatial information in complex visual scenes, and (2) existing deep neural network architectures make unacceptable mistakes in novel conditions. Our spatial artificial intelligence system, NavigAid, will address these problems and enable mobile and embedded devices to be used as efficient and reliable spatial intelligence aids. Enabled by our core technical innovation, Ally Networks, our system will robustly localize functionally and semantically relevant information in an environment, and provide navigational assistance beyond what is provided by guide dogs or by any other existing technology. It will provide contextually relevant, task-driven solutions to problems such as finding objects, identifying paths of ingress and egress, and understanding the salient features in an environment.
Ally Networks is a multimodal neural network architecture that learns robust spatial semantics rather than 2-dimensional feature representations. By propagating signals among different visual modalities such as depth and surface normals, it can extract spatial constraints that are not biased on visual features of the training data and therefore become less prone to errors. The unique multimodal learning strategy using Ally Networks is a high-risk endeavor that will have a broad impact if successful: Large-scale multimodal learning is a difficult problem, and the success of our technique will transform the state-of-the-art in neural-network vision. It has the potential to shift neural-network vision from systems that fail in often inscrutable ways, to systems that never fail under circumstances in which human vision would not also fail.
CREATING ROUTES
The route creation system uses AR services in mobile devices to keep track of the phone’s location relative to a set of environmental features. The process of identifying environmental features using AR is called “localization”. In this process we use the three-dimensional point features obtained from the environment to generate a map, in which we simultaneously locate the device. AR services provide the necessary infrastructure for this process. In addition to localizing the device, we use our object localization system to identify major objects such as doors, tables, and chairs along the path.
The route tracking system loads an existing map, localize the device in the environment, and load relevant path markers and environmental features. After loading the map, the user is guided to follow the path markers. Loading an existing map and localizing the device on the map is called “re-localization”, and is performed by the AR system. We will match path markers and environmental features in relation to the relocalized device.
Route Tracking feature of NavigAid allows recording a route in an environment and tracing it back when needed.
NARRATING SCENES
Scene describer provides distance and direction information for the objects in the environment.
We have developed an object localization system which we currently develop into into a scene narrative component. The Scene Narrative component uses our object localization system to locate a series of objects in the environment and inform the user. We design the interface of this application, which will both visually indicate the location of the object on the screen for low vision users, and verbally announce the presence of the object for blind users. The following tasks will be performed to develop the Scene Narrative component. Figure shows an initial UI design for the Scene Narrative component. We are currently developing the necessary elements using Swift UI for iOS. The components include the visual indicator for the identified objects, navigation layout and the control mechanism to start and stop the narration.
Following the completion of the interface components we will implement the controller functions that will communicate with our neural network architecture, feed the camera image to the network, and fetch the results.
[1] Kraft, Adam D. Vision by Alignment. Unpublished dissertation. Massachusetts Institute of Technology. 2018