Wayve, a leading artificial intelligence company based in the United Kingdom, introduces Lingo-2, a groundbreaking system that harnesses the power of natural language processing. It redefines the way self-driving cars perceive and navigate the world around them. It integrates vision, language, and action to explain and determine driving behavior. Wayve LINGO-2 uniquely allows driving instruction through natural language, enabling the model to adapt its behavior in response to language prompts for training purposes. Surprisingly, it can respond to language instruction and explain its driving actions in real time, marking a significant advancement in the development of autonomous driving technology.
Wayve LINGO-2 is a driving model that integrates vision, language, and action to explain and determine driving behavior. It is the first closed-loop vision-language-action driving model (VLAM) tested on public roads. The model consists of two modules: the Wayve vision model and the auto-regressive language model. The vision model processes camera images of consecutive timestamps into a sequence of tokens, while the language model is trained to predict a driving trajectory and commentary text. This integration of models opens up new capabilities for autonomous driving and human-vehicle interaction.
Wayve LINGO-2 uniquely allows driving instruction through natural language. It swaps the order of text tokens and driving action, making language a prompt for driving behavior. The model’s ability to change its behavior in the neural simulator in response to language prompts for training purposes demonstrates its adaptability.
By linking vision, language, and action directly, Wayve LINGO-2 explores how AI systems make decisions and open up a new level of control and customization for driving. The model can predict and respond to questions about the scene and its decisions while driving, providing real-time driving commentary and capturing its motion planning decisions. This powerful combination of vision, language, and action allows for a deeper understanding of the decision-making process of the driving model. It offers new possibilities for accelerating learning with natural language.
Wayve LINGO-2 represents a significant advancement in autonomous driving. Unlike its predecessor, Lingo-1, which operated in an open-loop system providing commentary based on visual inputs, LINGO-2 functions as a closed-loop system where it receives and processes language and visual data and acts on it. This enhancement facilitates real-time interaction between the vehicle and its environment, making autonomous driving more intuitive and responsive.
With Wayve LINGO-2, passengers can communicate directly with the vehicle using natural language. This interaction allows for a new level of engagement, where passengers can issue commands or ask for changes in the driving plan. For instance, a passenger might say, “Take the next left” or “Find a parking spot nearby.” LINGO-2 processes these instructions adjusts its driving strategy accordingly, and verbally confirms the action, ensuring the passenger is always in the loop about the car’s actions.
Wayve LINGO-2 enhances the driving experience by following commands and providing explanations and answering questions in real time. If a passenger is curious about why the car chose a particular route or asks what the current speed limit is, LINGO-2 can provide immediate and accurate answers. This capability is particularly useful in building trust and understanding between human passengers and the autonomous system, as it demystifies the technology and aligns it more closely with human-like interaction.
While LINGO-2 introduces several innovative features enhancing autonomous driving through language integration, it has limitations. These challenges stem primarily from the complexities of language processing combined with dynamic driving conditions. Ensuring the alignment of language-based inputs with driving actions remains a crucial area for ongoing development and refinement.
One of the critical challenges LINGO-2 faces is ensuring that the language instructions are perfectly aligned with the vehicle’s actions. This alignment is vital for safety and efficiency but is complicated by the ambiguity and variability of natural language. For example, a command like “take the next right” can be problematic if “next right” isn’t clearly defined by the immediate context or visible landmarks. The model must be trained to interpret such commands accurately within the vast array of possible driving scenarios it encounters.
Addressing noise and misinterpretations in commands given to Wayve LINGO-2 is essential for building a reliable copilot. Noise can occur in various forms, such as background sounds or poorly articulated instructions, leading to misinterpretations of the intended commands. These challenges require robust language processing algorithms to distinguish between relevant and irrelevant auditory data. Furthermore, Wayve LINGO-2 must be designed to request clarification when commands are unclear, ensuring that actions are always based on accurate and confirmed inputs. This approach enhances safety and builds trust with users by demonstrating the system’s ability to handle uncertainties intelligently.
Example: Navigating a junction
Example of LINGO-2 driving in Ghost Gym and being prompted to turn left on a clear road.
Example of LINGO-2 driving in Ghost Gym and being prompted to turn right on a clear road.
Example of LINGO-2 driving in Ghost Gym and being prompted to stop at the give-way line.
In this post, we introduced Wayve LINGO-2, the first driving model trained on language that has driven on public roads. We are excited to showcase how Wayve LINGO-2 can respond to language instruction and explain its driving actions in real time. This is a first step towards building embodied AI that can perform multiple tasks, starting with language and driving.
If you find this article helpful in understanding Wayve LINGO-2—Closed-Loop Vision-Language-Action Driving Model, comment below. Explore our blog section for more articles like this.