Placeholder Image

Subtitles section Play video

  • ♪ (music) ♪

  • Alright, hi, everybody.

  • I'm Alex, from the Brain Robotics team,

  • and in this presentation, I'll be talking about how we use simulation

  • and domain adaptation in some of our real-world robot learning problems.

  • So, first let me start by introducing robot learning.

  • The goal of robot learning is to use machine learning

  • to learn robotic skills

  • that work in general environments.

  • What we've seen so far is that

  • if you control your environment a lot,

  • you can get robots to do pretty impressive things,

  • and where techniques start to break down

  • is when you try to apply these same techniques

  • to more general environments.

  • And the thinking is that if you use machine learning,

  • then you can learn from your environment,

  • and this can help you address these generalization issues.

  • So, as a step in this direction,

  • we've been looking at the problem of robotic grasping.

  • This is a project that we've been working on

  • in collaboration with some people at X.

  • And to explain our problem setup a bit,

  • we're going to have a real robot arm

  • which is learning to pick up objects out of a bin.

  • There is going to be a camera looking down

  • over the shoulder of the arm into the bin,

  • and from this RGB image we're going to train a neural network

  • to learn what commands it should send to the robot

  • to successfully pick up objects.

  • Now, we want to try to solve this task using as few assumptions as possible.

  • So, importantly, we're not going to give any information

  • about the geometry of what kinds of objects we are trying to pick up,

  • and we're also not going to give

  • any information about the depth of the scene.

  • So in order to solve the task,

  • the model needs to learn hand-eye coordination

  • or it needs to see where it is within the camera image,

  • and then figure out where in the scene it is,

  • and then combine these two to figure out how it should move around.

  • Now, in order to train this model, we're going to need a lot of data

  • because it's a pretty large scale image model.

  • And our solution at the time for this was to simply use more robots.

  • So this is what we call the "Arm Farm."

  • These are six robots collecting data in parallel.

  • And if you have six robots, you can collect data a lot faster

  • than if you only have one robot.

  • So using these robots, we were able to collect

  • over a million attempted grasps,

  • over a total of thousands of robot hours,

  • and then using this data we were able

  • to successfully train models to learn how to pick up objects.

  • Now, this works, but it still took a lot of time

  • to collect this dataset.

  • So this motivated looking into ways

  • to reduce the amount of real-world data needed

  • to learn these behaviors.

  • One approach for doing this is simulation.

  • So in the left video here,

  • you can see the images that are going into our model

  • in our real world setup,

  • and on the right here you can see

  • our simulated recreation of that setup.

  • Now, the advantage of moving things into simulation

  • is that simulated robots are a lot easier to scale.

  • We've been able to spin up thousands of simulated robots

  • grasping various objects,

  • and using this setup we were able to collect millions of grasps

  • in just over eight hours,

  • instead of the weeks that were required for our original dataset.

  • Now, this is good for getting a lot of data,

  • but unfortunately models trained in simulation

  • tend not to transfer to the actual real world robot.

  • There are a lot of systematic differences between the two.

  • One big one is the visual appearances of different things.

  • And another big one is just physical differences

  • between our real-world physics

  • and our simulated physics.

  • So what we did was, we were able to very quickly

  • train our model on simulation to get to around 90% grasp success.

  • We then deployed to the real robot,

  • and it succeeds just over 20% of the time,

  • which is a very big performance drop.

  • So in order to actually get good performance,

  • we need to do something a bit more clever.

  • So this motivated looking into Sim-to-Real transfer,

  • which is a set of transfer-learning techniques

  • for trying to use simulated data

  • to improve your real-world sample efficiency.

  • Now, there are a few different ways you can do this.

  • One approach for doing this is

  • adding more randomization into your simulator.

  • You can do this by changing around the textures

  • that you apply to different objects,

  • changing around their colors,

  • changing how lighting is interacting with your scene,

  • and you can also play around with changing the geometry of what kinds of objects

  • you're trying to pick up.

  • Another way of doing this is domain adaptation,

  • which is a set of techniques for learning

  • when you have two domains of data that have some common structure,

  • but are still somewhat different.

  • In our case the two domains are going to be our simulated robot data

  • and our real robot data.

  • And there are feature-level ways of doing this

  • and there are pixel-level ways of doing this.

  • Now, in this work, we tried all of these approaches,

  • and in this presentation, I'm going to focus primarily

  • on the domain adaptation side of things.

  • So, in feature-level domain adaptation

  • what we're going to do is we're going to take our simulated data,

  • take our real data,

  • train the same model on both datasets,

  • but then at an intermediate feature layer of the network,

  • we're going to attach a similarity loss.

  • And the similarity loss is going to encourage the distribution of features

  • to be the same across both domains.

  • Now, one approach for doing this which has worked well recently

  • is called Domain-Adversarial Neural Networks.

  • And the way these work is that the similarity loss is implemented

  • as a small neural net that tries to predict the domain

  • based on the input features it's receiving,

  • and then the rest of the model is trying

  • to confuse this domain classifier as much as possible.

  • Now, pixel-level methods try to work at the problem

  • from a different point of view.

  • Instead of trying to learn domain invariant features,

  • we're going to try to transform our images at the pixel level

  • to look more realistic.

  • So what we do here is we take a generative-adversarial network;

  • we feed it an image from our simulator,

  • and then it's going to output an image that looks more realistic.

  • And then we're going to use the output of this generator

  • to train whatever task model that we want to train.

  • Now we're going to train both

  • the generator and the task model at the same time.

  • We found that in practice, this was useful

  • because it helps ground the generator output

  • to be useful for actually training your downstream task.

  • Alright. So taking a step back,

  • feature-level methods can learn domain-invariant features

  • when you have data from related domains

  • that aren't quite identical.

  • Meanwhile, pixel-level methods can transform your data

  • to look more like your real-world data,

  • but in practice they don't work perfectly,

  • and there are still some small artifacts

  • and inaccuracies from the generator output.

  • So our thinking went, "Why don't we simply combine both of these approaches?"

  • We can apply a pixel-level method

  • to try to transform the data as much as possible,

  • and this isn't going to get us all the way there,

  • but then we can attach a feature-level method on top of this

  • to try to close the reality gap even further,

  • and combined these form what we call the grasp gen

  • which is a combination of both

  • pixel-level and feature-level domain adaptation.

  • In the left half of the video here

  • you can see a simulated grasp.

  • In the right half you can see the output of our generator.

  • And you can see that it's learning some pretty cool things

  • in terms of drawing what the tray should look like,

  • drawing more realistic textures on the arm,

  • drawing shadows that the objects are casting.

  • It's also learned how to even draw shadows

  • as the arm is moving around in the scene.

  • And it certainly isn't perfect.

  • There are still these little odd splotches of color around,

  • but it's definitely learning something

  • about what it means for an image to look more realistic.

  • Now, this is good for getting a lot of pretty images,

  • but what matters for our problem is whether these images are actually useful

  • for reducing them onto real-world data required.

  • And we find that it does.

  • So, to explain this chart a bit:

  • On the x-axis is the number of real-world samples used,

  • and we compared the performance of different methods

  • as we vary them onto real-world data given to the model.

  • The blue bar is our performance when we use only simulated data.

  • The red bar is our performance when we use only real data,

  • and the orange bar is our performance when we use both simulated and real data

  • and the domain adaptation methods that I've been talking about.

  • And what we found is that when we use just 2%

  • of our original real-world dataset,

  • and we apply domain adaptation to it,

  • we're able to get the same level of performance

  • so this reduces the number of real-world samples we needed

  • by up to 50 times, which is really exciting

  • in terms of not needing to run robots for a large amount of time

  • to learn these grasping behaviors.

  • Additionally, we found that even when we give

  • all of the real-world data to the model,

  • when we give simulated data as well,

  • we're still able to see improved performance

  • so that implies that we haven't hit the data capacity limits

  • for this grasping problem.

  • And finally, there's a way to train this setup

  • without having real-world labels,

  • and when we trained the model in this setting,

  • we found that we were still able to get pretty good performance

  • on the real-world robot.

  • Now, this was the work of a large team

  • across both Brain as well as X.

  • I'd like to thank all of my collaborators.

  • Here's a link to the original paper.

  • And I believe there is also a blog post,

  • if people are interested in hearing more details.

  • Thanks.

  • (applause)

  • ♪ (music) ♪

♪ (music) ♪

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it