Scientific Frontline: IRL: LLMs Clarify Vague Robot Commands

Friday, June 26, 2026

IRL: LLMs Clarify Vague Robot Commands

“Masked IRL” helps a robot understand ambiguous instructions so it does chores safely. An LLM first elaborates on users' prompts based on demonstration data, then another narrows down which details an algorithm should incorporate into a motion plan.
Image Credit: Gabriel Maragaño

Scientific Frontline: Extended "At a Glance" Summary: Masked Inverse Reinforcement Learning (Masked IRL)

The Core Concept: A machine learning approach that utilizes dual large language models (LLMs) to clarify ambiguous human instructions and filter out irrelevant environmental data, enabling robots to safely execute complex tasks.

Key Distinction/Mechanism: Traditional robotic training requires extensive manual coding or exhaustive physical demonstrations. Masked IRL streamlines this by using one LLM to expand upon vague user prompts based on physical demonstration data, while a second LLM "masks" (ignores) irrelevant environmental details—scoring them as "0"—and prioritizing critical elements as "1" for the final algorithmic motion plan.

Origin/History: Developed by researchers at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Laboratory (CSAIL) and slated for presentation at the June 2026 IEEE International Conference on Robotics and Automation.

Major Frameworks/Components:

Kinesthetic Demonstration: A physical training method where human operators manually guide a robot's joints to perform a specific action.
Trajectory Comparison: An LLM evaluates the demonstrated sequence of motions against the shortest possible path to deduce the task's underlying intent.
Prompt Disambiguation: The first LLM refines vague commands, translating general requests like "stay close" into precise instructions such as "stay close to the surface of the table."
Environmental Masking: A secondary LLM assesses environmental details and object shapes, assigning binary scores to isolate the data necessary for the algorithm's final action plan.

Branch of Science: Computer Science, Artificial Intelligence, Robotics, and Machine Learning.

Future Application: Researchers intend to integrate camera-based visual data, allowing robots to dynamically identify and ignore irrelevant objects on the fly for deployment in domestic, office, and industrial manufacturing settings.

Why It Matters: The system reduces the required physical demonstration data by nearly five times and improves the identification of unstated user preferences by up to 15 percent, establishing a foundation for highly intuitive and safe human-robot interactions in unstructured environments.

Imagine working at a warehouse or office sometime in the near future, and you are asked to help a new trainee learn the basics of the job. The catch: it is a robot. To teach the robot, you might want to play a game of “show and tell”—that is, physically demonstrating how to do something in a few different ways while also explaining what you are doing.

Suppose you asked the robot to place some coffee on your desk without disturbing you during a Zoom call. You would prefer that the robot not get too close to you or the laptop, avoiding an interruption to your meeting. To enable this behavior, the robot should be trained with data that clearly demonstrates the full task. Computer scientists have attempted to explain manipulation tasks to robots by recording numerous physical demonstrations or writing extensive directions. If you do not have both, however, the machine is likely to misunderstand what it needs to do.

Because it is laborious for humans to do all that showing and telling, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have automated the process of teaching a robot. This method automatically clarifies instructions and uses approximately one-fifth the demonstration data. Their Masked Inverse Reinforcement Learning (Masked IRL) approach uses a large language model (LLM) to elaborate on ambiguous prompts based on data collected from a user’s demonstration. Another LLM then narrows down which details an algorithm should incorporate into a motion plan so that a robot can safely complete chores in homes, offices, and factories.

“Our approach could come in handy when a human interacts with a robot but doesn’t want to spell out all the details of a task,” says MIT PhD student and CSAIL researcher Minyoung Hwang, a lead author on a paper presenting the project. “We’re minimizing human effort by enabling machines to get to the bottom of what users really want.”

According to Hwang, Masked IRL can help robots safely maneuver in settings containing elements a human might not describe in a prompt but that are nonetheless crucial. For example, a machine grabbing a snack from the kitchen may not know to avoid bumping into a laptop. Likewise, a factory robot placing items into different boxes must carefully navigate around shelves.

To learn new tasks in these situations, Masked IRL uses the robot’s sensors to capture information about its surroundings. These components also log each movement of a kinesthetic demonstration—a training approach in which a human physically moves a robot to perform a specific action. This process is akin to acting as the machine’s physical therapist, bending joints in a particular direction to show a robot how to grab, move, and place objects.

MIT’s system then calls on an LLM to compare this sequence of motions, called a trajectory, to the shortest possible path. The model also elaborates on what might be unclear in a prompt, turning a request like “stay close” into “stay close to the surface of the table.” Using the trajectory comparison and clarified directions, the LLM begins to understand why the motions it was trained on are important to the task.

A second LLM then evaluates details of the environment, such as the positions of obstacles and the shape of the robot’s target object. During this process, it “masks” (ignores) the elements it deems irrelevant to the task at hand, scoring each feature as either 1 (important) or 0 (unimportant). For example, whether a user was leaning on a table during a demonstration would be scored 0, rendering it irrelevant. An algorithm incorporates any detail scored 1 into the final action plan.

These masks provided Masked IRL with a key advantage over comparable baselines in both three-dimensional and real-world demonstrations by teaching the robot which information to prioritize. Through the researchers’ system, virtual and physical robots alike skillfully maneuvered objects around obstacles, such as moving a coffee mug around a laptop to different spots on a table. In these tasks, Masked IRL correctly identified users’ preferences—which were not explicitly stated in their prompts—up to 15 percent more often than comparable baselines.

During simulation experiments, CSAIL researchers found that Masked IRL learned rapidly, requiring fewer demonstrations to understand how to move the mug than baseline models. They also noted that the robots performed better when an LLM clarified instructions rather than leaving the machine to attempt following a vague request.

This highly focused approach translated well to a physical robotic arm executing prompts the system had not seen during its training phase. After being trained on fifty kinesthetic demonstrations, the robot carefully moved a cup toward a human while avoiding a collision with the user’s computer—an obstacle it learned to avoid by elaborating on a general request to “stay away.” It also wiped down a table while “staying close” to it and handed a user a bag of chips while “staying away” from both the human and the table.

Masked IRL senses and explains what users leave unsaid, but soon it might “see” it as well. CSAIL researchers plan to make their approach more dynamic by equipping the system with cameras, allowing the robot to take images of its surroundings. The system could then highlight and focus on specific nearby elements. For example, if asked to pick up a toy, the machine might identify nearby bananas and ignore them before handling the target object.

Additional information: Hwang wrote the paper with three CSAIL colleagues: PhD student Alexandra Forsey-Smerek ’20, SM ’22; postdoctoral researcher Nathaniel Dennler; and MIT Assistant Professor Andreea Bobu, a member of the Department of Aeronautics and Astronautics and CSAIL. They will present the project at the IEEE International Conference on Robotics and Automation in June 2026.

Funding: Their work was supported in part by the Tata Group via the MIT Generative AI Impact Consortium Award and the Department of Defense.

Published in journal: arXiv

Title: Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language

Authors: Minyoung Hwang, Alexandra Forsey-Smerek, Nathaniel Dennler, and Andreea Bobu

Source/Credit: Massachusetts Institute of Technology | Alex Shipps / MIT CSAIL

Edited by: Scientific Frontline

Reference Number: cs062626_01

Privacy Policy | Terms of Service | Contact Us