Subtitles section Play video Print subtitles ♪ (music) ♪ Alright, hi, everybody. I'm Alex, from the Brain Robotics team, and in this presentation, I'll be talking about how we use simulation and domain adaptation in some of our real-world robot learning problems. So, first let me start by introducing robot learning. The goal of robot learning is to use machine learning to learn robotic skills that work in general environments. What we've seen so far is that if you control your environment a lot, you can get robots to do pretty impressive things, and where techniques start to break down is when you try to apply these same techniques to more general environments. And the thinking is that if you use machine learning, then you can learn from your environment, and this can help you address these generalization issues. So, as a step in this direction, we've been looking at the problem of robotic grasping. This is a project that we've been working on in collaboration with some people at X. And to explain our problem setup a bit, we're going to have a real robot arm which is learning to pick up objects out of a bin. There is going to be a camera looking down over the shoulder of the arm into the bin, and from this RGB image we're going to train a neural network to learn what commands it should send to the robot to successfully pick up objects. Now, we want to try to solve this task using as few assumptions as possible. So, importantly, we're not going to give any information about the geometry of what kinds of objects we are trying to pick up, and we're also not going to give any information about the depth of the scene. So in order to solve the task, the model needs to learn hand-eye coordination or it needs to see where it is within the camera image, and then figure out where in the scene it is, and then combine these two to figure out how it should move around. Now, in order to train this model, we're going to need a lot of data because it's a pretty large scale image model. And our solution at the time for this was to simply use more robots. So this is what we call the "Arm Farm." These are six robots collecting data in parallel. And if you have six robots, you can collect data a lot faster than if you only have one robot. So using these robots, we were able to collect over a million attempted grasps, over a total of thousands of robot hours, and then using this data we were able to successfully train models to learn how to pick up objects. Now, this works, but it still took a lot of time to collect this dataset. So this motivated looking into ways to reduce the amount of real-world data needed to learn these behaviors. One approach for doing this is simulation. So in the left video here, you can see the images that are going into our model in our real world setup, and on the right here you can see our simulated recreation of that setup. Now, the advantage of moving things into simulation is that simulated robots are a lot easier to scale. We've been able to spin up thousands of simulated robots grasping various objects, and using this setup we were able to collect millions of grasps in just over eight hours, instead of the weeks that were required for our original dataset. Now, this is good for getting a lot of data, but unfortunately models trained in simulation tend not to transfer to the actual real world robot. There are a lot of systematic differences between the two. One big one is the visual appearances of different things. And another big one is just physical differences between our real-world physics and our simulated physics. So what we did was, we were able to very quickly train our model on simulation to get to around 90% grasp success. We then deployed to the real robot, and it succeeds just over 20% of the time, which is a very big performance drop. So in order to actually get good performance, we need to do something a bit more clever. So this motivated looking into Sim-to-Real transfer, which is a set of transfer-learning techniques for trying to use simulated data to improve your real-world sample efficiency. Now, there are a few different ways you can do this. One approach for doing this is adding more randomization into your simulator. You can do this by changing around the textures that you apply to different objects, changing around their colors, changing how lighting is interacting with your scene, and you can also play around with changing the geometry of what kinds of objects you're trying to pick up. Another way of doing this is domain adaptation, which is a set of techniques for learning when you have two domains of data that have some common structure, but are still somewhat different. In our case the two domains are going to be our simulated robot data and our real robot data. And there are feature-level ways of doing this and there are pixel-level ways of doing this. Now, in this work, we tried all of these approaches, and in this presentation, I'm going to focus primarily on the domain adaptation side of things. So, in feature-level domain adaptation what we're going to do is we're going to take our simulated data, take our real data, train the same model on both datasets, but then at an intermediate feature layer of the network, we're going to attach a similarity loss. And the similarity loss is going to encourage the distribution of features to be the same across both domains. Now, one approach for doing this which has worked well recently is called Domain-Adversarial Neural Networks. And the way these work is that the similarity loss is implemented as a small neural net that tries to predict the domain based on the input features it's receiving, and then the rest of the model is trying to confuse this domain classifier as much as possible. Now, pixel-level methods try to work at the problem from a different point of view. Instead of trying to learn domain invariant features, we're going to try to transform our images at the pixel level to look more realistic. So what we do here is we take a generative-adversarial network; we feed it an image from our simulator, and then it's going to output an image that looks more realistic. And then we're going to use the output of this generator to train whatever task model that we want to train. Now we're going to train both the generator and the task model at the same time. We found that in practice, this was useful because it helps ground the generator output to be useful for actually training your downstream task. Alright. So taking a step back, feature-level methods can learn domain-invariant features when you have data from related domains that aren't quite identical. Meanwhile, pixel-level methods can transform your data to look more like your real-world data, but in practice they don't work perfectly, and there are still some small artifacts and inaccuracies from the generator output. So our thinking went, "Why don't we simply combine both of these approaches?" We can apply a pixel-level method to try to transform the data as much as possible, and this isn't going to get us all the way there, but then we can attach a feature-level method on top of this to try to close the reality gap even further, and combined these form what we call the grasp gen which is a combination of both pixel-level and feature-level domain adaptation. In the left half of the video here you can see a simulated grasp. In the right half you can see the output of our generator. And you can see that it's learning some pretty cool things in terms of drawing what the tray should look like, drawing more realistic textures on the arm, drawing shadows that the objects are casting. It's also learned how to even draw shadows as the arm is moving around in the scene. And it certainly isn't perfect. There are still these little odd splotches of color around, but it's definitely learning something about what it means for an image to look more realistic. Now, this is good for getting a lot of pretty images,