Subtitles section Play video Print subtitles [ MUSIC ] [ APPLAUSE ] BENGIO: Thank you. All right. Thank you for being here and participating in this colloquium. So, I'll tell you about some of the things that are happening in deep learning, but I only have 30 minutes so I'll be kind of quickly going through some subjects and some challenges for scaling up deep learning towards AI. Hopefully you'll have chances to ask me some questions during the panel that follows. One thing I want to mention is I'm writing a book. It's called Deep Learning, and you can already download most of the chapters. These are draft versions of the chapters from my web page. It's going to be an MIT Press book hopefully next year. So, what is deep learning and why is everybody excited about it? First of all, deep learning is just an approach to machine learning. And what's particular about it, as Terry was saying, it's inspired by brains. Inspired, we're trying to understand some of the principles, computational and mathematical principles that could explain the kind of intelligence based on learning that we see in brains. But from a computer science perspective, the idea is that these algorithms learn representations. So, representations is a central concept in deep learning, and, of course, the idea of learning representations is not new. It was part of the deal of the original neural nets, like the Boltzmann machine and the back prop from the '80s. But what's new here and what happened about ten years ago is a breakthrough that allowed us to train deeper neural networks, meaning that have multiple levels of representation. And why is that interesting? So already I mentioned that there are some theoretical results showing that you can represent some complicated functions that are the result of the many levels of compositions efficiently with these deep networks, whereas you might -- or in general, you won't be able to represent these kinds of functions with a shallow network that doesn't have enough levels. What does it mean to have more depth? It means that you're able to represent more abstracts concepts, and these more abstract concepts allow these machines to generalize better. So, that's the essence of what's going on here. All right. So, the breakthrough happened in 2006 where, for the first time, we were able to train these deeper networks and we used unsupervised learning for that, but it took a few years before these advances made their way to industry and to large scale applications. So, it started around 2010 with speech recognition. By 2012, if you had an Android phone, like this one, well, you had neural nets doing speech recognition in them. And now, of course, it's everywhere. For speech, it's changed the field of speech recognition. Everything uses it, essentially. Then about two years later, 2012, there was another breakthrough using convolution networks, which are a particular kind of deep networks that had been around for a long time but that have been improved using some of the techniques we discovered along these -- in recent years. Really allowed us to make big impact in the field of computer vision and object recognition, in particular. So, I'm sure [Faye Faye] will say a few words later about that event and then the role of the image net dataset in this. But what's going on now is that neural nets are going beyond their traditional realm of perception and people are exploring how to use them for understanding language. Of course, we haven't yet solved that problem. This is where a lot of the action is now and, of course, continues a lot of research and R&D and computer vision. Now, for example, expanding to video and many other areas. But I'm particularly interested in the extension of this field in natural language. There are other areas. You've heard about reinforcement learning. There is a lot of action there, robotics, control. So, many areas of AI are now more and more seeing the potential gain coming from using these more abstract systems. So, today, I'm going to go through three of the main challenges that I see for bringing deep learning, as we know it today, closer to AI. One of them is computational. Of course, for a company like IBM and other companies that build machines, this is an important challenge. It's an important challenge because what we've observed is that the bigger the models we are able to train, given the amount of data we currently have, the better they are. So, you know, we just keep building bigger models and hopefully we're going to continue improving. Now, that being said, I think it's not going to be enough so there are other challenges. One of them I mentioned has to do with understanding language. But understanding language actually requires something more. It requires a form of reasoning. So, people are starting to use these recurrent nets you heard about, recurrent networks that can be very deep, in some sense, when you consider time in order to combine different pieces of evidence, in order to provide answers to questions. And essentially, displayed in different forms of reasoning. So, I'll say a few words about that challenge. And finally, maybe one of the most important challenges that's maybe more fundamental even is the unsupervised learning challenge. Up to now, all of the industrial applications of deep learning have exploited supervised learning where we have labeled the data we've said in that image, it's a cat. In that image, there's a desk, and so on. But there's a lot more data we could take advantage of that's unlabeled, and that's going to be important because all of na information we need to build these AIs has to come from somewhere, and we need enough data, and most of it is not going to be labeled. Right. So, as I mentioned, and I guess as my colleague, Ilya Sutskever from Google keeps saying, bigger is better. At least up to now, we haven't seen the limitations. I do believe that there are obstacles, and bigger is not going to be enough. But clearly, there's an easy path forward with the current algorithms just by making our neural nets a hundred times faster and bigger. So, why is that? Basically, what I see in many experiments with neural nets right now is that they -- I'm going to use some jargon here. They under fit, meaning that they're not big enough or we don't train them long enough for them to exploit all of the information that there is in the data. And so they're not even able to learn the data by heart, right, which is the thing we usually want to avoid in machine learning. But that comes almost for free with these networks, and so we just have to press on the pedal of more capacity and we're almost sure to get an improvement here. All right. To just illustrate graphically that we have some room to approach the size of human brains, this picture was made up by my former student, Ian Goodfellow, where we see the sizes of different organisms and neural nets over the years so the DBN here was from 2006. Of the AlexNet is the breakthrough network of 2012 for computer vision, and the AdamNet is maybe a couple of years old. So, we see that the current technology is maybe between a bee and a frog in terms of size of the networks for about the same number of synapses. So, we've almost reached the kind of average number of synapses you see in natural brains, between a thousand and ten thousand. In terms of number of neurons, we're several orders of ranking away. So, I'm going to tell you a little bit about a stream of research we've been pushing in my lab, which is more connected to the computing challenge and potentially part of our implementation, which is can we train neural nets that have very low precision. So, we had a first paper at ICLR. By the way, ICLR is the deep learning conference, and it happens every year now. Yann Lecun and I started it in 2013 and it's been an amazing success that year and every year since then. We're going to have a third version next May. And so we wanted to know how many bits do you actually require. Of course, people have been asking these kinds of questions for decades. But using sort of the current state of the art neural nets and we found 12, and I can show you some pictures how we got these numbers on different data sets and comparing different ways of representing numbers with fixed point or dynamic fixed point. And also, depending on where I use those bits, you actually need less bits in the activations than in the weights. So, you need more rescission in the weights. So, that was the first investigation. But then we thought -- so that's the -- for the weights, that's the number of bits you actually need to keep the information that you are accumulating from many examples. But when you actually run your system during training, especially, maybe you don't need all those bits. Maybe you can get the same effect by introducing noise and discretizing randomly those weights to plus one or minus one. So, that's exactly what we did. The idea is -- the cute idea here is that we can replace a real number by a binary number that has the same expected value by, you know, sampling those two values with a probability such as that the expected value is the correct one. And now, instead of having a real number to multiply, we have a bit to multiply, which is easy. It's just an addition. And why would we do that? Because we want to get rid of multiplications. Multiplications is what takes up most of the surface area on chips for doing neural nets. So, we had a first try at this, and this is going to be presented at the next NIPS in the next few weeks in Montreal. And it allows us to get rid of the multiplications in the feed forward computation and in the backward computation where we compute gradients. But we remained with the multiplication -- even if you discretize the weights, there is another multiplication at the end of the back prop where you multiply -- you don't multiply weights. You multiply activations and gradients. So, if those two things are real valued, you still need regular multiplication. So, we -- yes, so that's going to be in the NIPS paper. But the new thing we did is to get rid of that last multiplication that we need for the update