A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models

Nils Ingelhag*, Jesper Munkeby*, Jonne van Haastregt*, Anastasia Varava, Michael C. Welle, Danica Kragic

In this paper, we build upon two major recent developments in the field, Diffusion Policies for visuomotor manipulation and large pre-trained multimodal foundational models to obtain a robotic skill learning system. The system can obtain new skills via the behavioral cloning approach of visuomotor diffusion policies given teleoperated demonstrations. Foundational models are being used to perform skill selection given the user's prompt in natural language. Before executing a skill the foundational model performs a precondition check given an observation of the workspace. We compare the performance of different foundational models to this end as well as give a detailed experimental evaluation of the skills taught by the user in simulation and the real world. Finally, we showcase the combined system on a challenging food serving scenario in the real world. Videos of all experimental executions, as well as the process of teaching new skills in simulation and the real world, are available on the project's website.

Robo Skill Framework

Download Preprint

Training Skills

We trained seven diffrent skills (3 in Simulation, 4 in the real world), demonstration times are shown in the histogram:

User demonstration time histogram

Real world

Bottle opening

Lid removal

Rice scooping

Sausage placing

Simulation

Lid removal

Box pushing

Item placing

Evaluating Skills

We evaluated the 4 skills in the real world on 10 trails and the 3 skills in simulation on 100 trails using diffrent initial conditions of the objct in the workspace.

Real world

Bottle opening 60%

Lid removal 90%

Rice scooping 90%

Sausage placing 90%

Simulation

Lid removal 83%

Box pushing single view 20%

Box pushing multi view 74%

Item placing 96%

Foundational Models evaluation

We evaluated the LLM and VLM version of GPT-4 and GEMINI (20/03/2024) on our specific task of skill selection given user prompt and precondition checker given img.

Results

LLMs

GPT-4 Examples

Prompt

You are an expert skill selector that has to match skills that are given to a user's request. If none of the skills given to you are fulfilling the users request answer with "NEW SKILL". Your skills are: NAME:SERVE RICE EFFECT:This skill serves rice from a white bowl into a red bowl User request: Serve the rice please. Your answer: ... Structure your answer in this format: [reasoning without metioning the names of skills] [Skill Name]

Response

The user is asking for a skill that allows them to serve rice which is exactly what the skill in question does. SERVE RICE

This skill match is correct

All GPT-4 results (640 prompts)

Gemini Examples

Prompt

You are an expert skill selector that has to match skills that are given to a user's request. If none of the skills given to you are fulfilling the users request answer with "NEW SKILL". Your skills are: NAME:SERVE RICE EFFECT:This skill serves rice from a white bowl into a red bowl User request: Serve the rice please. Your answer: ... Structure your answer in this format: [reasoning without metioning the names of skills] [Skill Name]

Response

The user wants to serve rice. SERVE RICE

This skill match is correct

All Gemini results (640 prompts)

VLMs

GPT-4 Examples

Prompt

Please check if the following conditions are met in the image: A glass pan cover has to be present and not on the table. Answer format for each precondition: [Short Reasoning] [YES/NO] End the response with a definitive answer (YES/NO) on whether ALL conditions are met on a new line.

Response

Based on the provided images: 1. Glass pan cover is present: In both images, there is a glass pan cover with a green knob visible on the right side of the frame. [YES] 2. Glass pan cover is not on the table: In both images, the glass pan cover is resting on the table surface. [NO] Not all conditions are met. NO

This precondition validation is correct

All GPT-4 results left image (440 prompts)

All GPT-4 results right image (440 prompts)

All GPT-4 results both images (440 prompts)

Gemini Examples

Prompt

Please check if the following conditions are met in the image: The bottle needs to be closed with a metal cap Answer format for each precondition: [Short Reasoning] [YES/NO] End the response with a definitive answer (YES/NO) on whether ALL conditions are met on a new line.

Response

The bottle is not closed with a metal cap NO NO

This precondition validation is correct

All Geminiresults left image (440 prompts)

All Gemini results right image (440 prompts)

All Gemini results both images (440 prompts)

Use the following to cite us:

Contact

  • Michael C. Welle; mwelle(at)kth.se; KTH Royal Institute of Technology, Sweden