In this paper, we build upon two major recent developments in the field, Diffusion Policies for visuomotor manipulation and large pre-trained multimodal foundational models to obtain a robotic skill learning system. The system can obtain new skills via the behavioral cloning approach of visuomotor diffusion policies given teleoperated demonstrations. Foundational models are being used to perform skill selection given the user's prompt in natural language. Before executing a skill the foundational model performs a precondition check given an observation of the workspace. We compare the performance of different foundational models to this end as well as give a detailed experimental evaluation of the skills taught by the user in simulation and the real world. Finally, we showcase the combined system on a challenging food serving scenario in the real world. Videos of all experimental executions, as well as the process of teaching new skills in simulation and the real world, are available on the project's website.
Robo Skill Framework
Download Preprint
We trained seven diffrent skills (3 in Simulation, 4 in the real world), demonstration times are shown in the histogram:
Bottle opening
Lid removal
Rice scooping
Sausage placing
Lid removal
Box pushing
Item placing
We evaluated the 4 skills in the real world on 10 trails and the 3 skills in simulation on 100 trails using diffrent initial conditions of the objct in the workspace.
Bottle opening 60%
Lid removal 90%
Rice scooping 90%
Sausage placing 90%
Lid removal 83%
Box pushing single view 20%
Box pushing multi view 74%
Item placing 96%
We evaluated the LLM and VLM version of GPT-4 and GEMINI (20/03/2024) on our specific task of skill selection given user prompt and precondition checker given img.
GPT-4 Examples
You are an expert skill selector that has to match skills that are given to a user's request. If none of the skills given to you are fulfilling the users request answer with "NEW SKILL". Your skills are: NAME:SERVE RICE EFFECT:This skill serves rice from a white bowl into a red bowl User request: Serve the rice please. Your answer: ... Structure your answer in this format: [reasoning without metioning the names of skills] [Skill Name]
The user is asking for a skill that allows them to serve rice which is exactly what the skill in question does. SERVE RICE
This skill match is correct
All GPT-4 results (640 prompts)
Gemini Examples
You are an expert skill selector that has to match skills that are given to a user's request. If none of the skills given to you are fulfilling the users request answer with "NEW SKILL". Your skills are: NAME:SERVE RICE EFFECT:This skill serves rice from a white bowl into a red bowl User request: Serve the rice please. Your answer: ... Structure your answer in this format: [reasoning without metioning the names of skills] [Skill Name]
The user wants to serve rice. SERVE RICE
This skill match is correct
All Gemini results (640 prompts)
GPT-4 Examples
Please check if the following conditions are met in the image: A glass pan cover has to be present and not on the table. Answer format for each precondition: [Short Reasoning] [YES/NO] End the response with a definitive answer (YES/NO) on whether ALL conditions are met on a new line.
Based on the provided images: 1. Glass pan cover is present: In both images, there is a glass pan cover with a green knob visible on the right side of the frame. [YES] 2. Glass pan cover is not on the table: In both images, the glass pan cover is resting on the table surface. [NO] Not all conditions are met. NO
This precondition validation is correct
All GPT-4 results left image (440 prompts)
All GPT-4 results right image (440 prompts)
All GPT-4 results both images (440 prompts)
Gemini Examples
Please check if the following conditions are met in the image: The bottle needs to be closed with a metal cap Answer format for each precondition: [Short Reasoning] [YES/NO] End the response with a definitive answer (YES/NO) on whether ALL conditions are met on a new line.
The bottle is not closed with a metal cap NO NO
This precondition validation is correct
All Geminiresults left image (440 prompts)