Scientific Frontline: "At a Glance" Summary
- Main Discovery: Columbia Engineering researchers developed a robot that autonomously learns to lip-sync to speech and song through observational learning, bypassing traditional rule-based programming.
- Methodology: The system utilizes a "vision-to-action" language model (VLA) where the robot first maps its own facial mechanics by watching its reflection, then correlates these movements with human lip dynamics observed in YouTube videos.
- Specific Detail/Mechanism: The robot features a flexible silicone skin driven by 26 independent motors, allowing it to translate audio signals directly into motor actions without explicit instruction on phoneme shapes.
- Key Statistic or Data: The robot successfully articulated words in multiple languages and performed songs from an AI-generated album, utilizing training data from thousands of random facial expressions and hours of human video footage.
- Context or Comparison: Unlike standard humanoids that use rigid, pre-defined facial choreographies, this data-driven approach aims to resolve the "Uncanny Valley" effect by generating fluid, human-like motion.
- Significance/Future Application: This technology addresses the "missing link" of facial affect in robotics, a critical component for effectively deploying humanoid robots in social roles such as elder care, education, and service industries.
Almost half of our attention during face-to-face conversation focuses on lip motion. Yet, robots still struggle to move their lips correctly. Even the most advanced humanoids make little more than muppet mouth gestures – if they have a face at all.
We humans attribute outsized importance to facial gestures in general, and to lip motion in particular. While we may forgive a funny walking gait or an awkward hand motion, we remain unforgiving of even the slightest facial malgesture. This high bar is known as the “Uncanny Valley.” Robots oftentimes look lifeless, even creepy, because their lips don't move. But that is about to change.
A Columbia Engineering team announced today that they have created a robot that, for the first time, is able to learn facial lip motions for tasks such as speech and singing. In a new study published in Science Robotics, the researchers demonstrate how their robot used its abilities to articulate words in a variety of languages, and even sing a song out of its AI-generated debut album “hello world_.”
The robot acquired this ability through observational learning rather than via rules. It first learned how to use its 26 facial motors by watching its own reflection in the mirror before learning to imitate human lip motion by watching hours of YouTube videos.
“The more it interacts with humans, the better it will get,” promised Hod Lipson, James and Sally Scapa Professor of Innovation in the Department of Mechanical Engineering and director of Columbia’s Creative Machines Lab, where the work was done.
Robot watches itself talking
Achieving realistic robot lip motion is challenging for two reasons: First, it requires specialized hardware containing a flexible facial skin actuated by numerous tiny motors that can work quickly and silently in concert. Second, the specific pattern of lip dynamics is a complex function dictated by sequences of vocal sounds and phonemes.
Human faces are animated by dozens of muscles that lie just beneath a soft skin and sync naturally to vocal chords and lip motions. By contrast, humanoid faces are mostly rigid, operating with relatively few degrees of motion, and their lip movement is choreographed according to rigid, predefined rules. The resulting motion is stilted, unnatural, and uncanny.
In this study, the researchers overcame these hurdles by developing a richly actuated, flexible face and then allowing the robot to learn how to use its face directly by observing humans. First, they placed a robotic face equipped with 26 motors in front of a mirror so that the robot could learn how its own face moves in response to muscle activity. Like a child making faces in a mirror for the first time, the robot made thousands of random face expressions and lip gestures. Over time, it learned how to move its motors to achieve particular facial appearances, an approach called a “vision-to-action” language model (VLA).
Then, the researchers placed the robot in front of recorded videos of humans talking and singing, giving AI that drives the robot an opportunity to learn how exactly humans’ mouths moved in the context of various sounds they emitted. With these two models in hand, the robot’s AI could now translate audio directly into lip motor action.
The researchers tested this ability using a variety of sounds, languages, and contexts, as well as some songs. Without any specific knowledge of the audio clips' meaning, the robot was then able to move its lips in sync.
The researchers acknowledge that the lip motion is far from perfect. “We had particular difficulties with hard sounds like ‘B’ and with sounds involving lip puckering, such as ‘W’. But these abilities will likely improve with time and practice,” Lipson said.
More importantly, however, is seeing lip sync as part of more holistic robot communication ability.
“When the lip sync ability is combined with conversational AI such as ChatGPT or Gemini, the effect adds a whole new depth to the connection the robot forms with the human,” explained Yuhang Hu, who led the study for his PhD. “The more the robot watches humans conversing, the better it will get at imitating the nuanced facial gestures we can emotionally connect with.”
“The longer the context window of the conversation, the more context-sensitive these gestures will become,” he added.
The missing link of robotic ability
The researchers believe that facial affect is the ‘missing link’ of robotics.
“Much of humanoid robotics today is focused on leg and hand motion, for activities like walking and grasping,” said Lipson. “But facial affection is equally important for any robotic application involving human interaction.”
Lipson and Hu predict that warm, lifelike faces will become increasingly important as humanoid robots find applications in areas such as entertainment, education, medicine, and even elder care. Some economists predict that over a billion humanoids will be manufactured in the next decade.
“There is no future where all these humanoid robots don’t have a face. And when they finally have a face, they will need to move their eyes and lips properly, or they will forever remain uncanny,” Lipson estimates.
“We humans are just wired that way, and we can’t help it. We are close to crossing the uncanny valley,” added Hu.
Risks and limits
This work is part of Lipson’s decade-long quest to find ways to make robots connect more effectively with humans, through mastering facial gestures such as smiling, gazing, and speaking. He insists that these abilities must be acquired by learning, rather than being programmed using stiff rules.
“Something magical happens when a robot learns to smile or speak just by watching and listening to humans,” he said. “I’m a jaded roboticist, but I can’t help but smile back at a robot that spontaneously smiles at me.”
Hu explained that human faces are the ultimate interface for communication, and we are beginning to unlock their secrets.
“Robots with this ability will clearly have a much better ability to connect with humans because such a significant portion of our communication involves facial body language, and that entire channel is still untapped,” Hu said.
The researchers are aware of the risks and controversies surrounding granting robots greater ability to connect with humans.
“This will be a powerful technology. We have to go slowly and carefully, so we can reap the benefits while minimizing the risks,” Lipson said.
Funding: The study was supported by US National Science Foundation (NSF) AI Institute for Dynamical Systems (DynamicsAI.org) and a gift from Amazon to Columbia AI institute.
Published in journal: Science Robotics
Title: Learning realistic lip motions for humanoid face robots
Authors: Yuhang Hu, Jiong Lin, Judah Allen Goldfeder, Philippe M. Wyder, Yifeng Cao, Steven Tian, Yunzhe Wang, Jingran Wang, Mengmeng Wang, Jie Zeng, Cameron Mehlman, Yingke Wang, Delin Zeng, Boyuan Chen, and Hod Lipson
Source/Credit: Columbia Engineering
Reference Number: eng011426_01