Subtitles section Play video Print subtitles [ MUSIC ] [ APPLAUSE ] BENGIO: Thank you. All right. Thank you for being here and participating in this colloquium. So, I'll tell you about some of the things that are happening in deep learning, but I only have 30 minutes so I'll be kind of quickly going through some subjects and some challenges for scaling up deep learning towards AI. Hopefully you'll have chances to ask me some questions during the panel that follows. One thing I want to mention is I'm writing a book. It's called Deep Learning, and you can already download most of the chapters. These are draft versions of the chapters from my web page. It's going to be an MIT Press book hopefully next year. So, what is deep learning and why is everybody excited about it? First of all, deep learning is just an approach to machine learning. And what's particular about it, as Terry was saying, it's inspired by brains. Inspired, we're trying to understand some of the principles, computational and mathematical principles that could explain the kind of intelligence based on learning that we see in brains. But from a computer science perspective, the idea is that these algorithms learn representations. So, representations is a central concept in deep learning, and, of course, the idea of learning representations is not new. It was part of the deal of the original neural nets, like the Boltzmann machine and the back prop from the '80s. But what's new here and what happened about ten years ago is a breakthrough that allowed us to train deeper neural networks, meaning that have multiple levels of representation. And why is that interesting? So already I mentioned that there are some theoretical results showing that you can represent some complicated functions that are the result of the many levels of compositions efficiently with these deep networks, whereas you might -- or in general, you won't be able to represent these kinds of functions with a shallow network that doesn't have enough levels. What does it mean to have more depth? It means that you're able to represent more abstracts concepts, and these more abstract concepts allow these machines to generalize better. So, that's the essence of what's going on here. All right. So, the breakthrough happened in 2006 where, for the first time, we were able to train these deeper networks and we used unsupervised learning for that, but it took a few years before these advances made their way to industry and to large scale applications. So, it started around 2010 with speech recognition. By 2012, if you had an Android phone, like this one, well, you had neural nets doing speech recognition in them. And now, of course, it's everywhere. For speech, it's changed the field of speech recognition. Everything uses it, essentially. Then about two years later, 2012, there was another breakthrough using convolution networks, which are a particular kind of deep networks that had been around for a long time but that have been improved using some of the techniques we discovered along these -- in recent years. Really allowed us to make big impact in the field of computer vision and object recognition, in particular. So, I'm sure [Faye Faye] will say a few words later about that event and then the role of the image net dataset in this. But what's going on now is that neural nets are going beyond their traditional realm of perception and people are exploring how to use them for understanding language. Of course, we haven't yet solved that problem. This is where a lot of the action is now and, of course, continues a lot of research and R&D and computer vision. Now, for example, expanding to video and many other areas. But I'm particularly interested in the extension of this field in natural language. There are other areas. You've heard about reinforcement learning. There is a lot of action there, robotics, control. So, many areas of AI are now more and more seeing the potential gain coming from using these more abstract systems. So, today, I'm going to go through three of the main challenges that I see for bringing deep learning, as we know it today, closer to AI. One of them is computational. Of course, for a company like IBM and other companies that build machines, this is an important challenge. It's an important challenge because what we've observed is that the bigger the models we are able to train, given the amount of data we currently have, the better they are. So, you know, we just keep building bigger models and hopefully we're going to continue improving. Now, that being said, I think it's not going to be enough so there are other challenges. One of them I mentioned has to do with understanding language. But understanding language actually requires something more. It requires a form of reasoning. So, people are starting to use these recurrent nets you heard about, recurrent networks that can be very deep, in some sense, when you consider time in order to combine different pieces of evidence, in order to provide answers to questions. And essentially, displayed in different forms of reasoning. So, I'll say a few words about that challenge. And finally, maybe one of the most important challenges that's maybe more fundamental even is the unsupervised learning challenge. Up to now, all of the industrial applications of deep learning have exploited supervised learning where we have labeled the data we've said in that image, it's a cat. In that image, there's a desk, and so on. But there's a lot more data we could take advantage of that's unlabeled, and that's going to be important because all of na information we need to build these AIs has to come from somewhere, and we need enough data, and most of it is not going to be labeled. Right. So, as I mentioned, and I guess as my colleague, Ilya Sutskever from Google keeps saying, bigger is better. At least up to now, we haven't seen the limitations. I do believe that there are obstacles, and bigger is not going to be enough. But clearly, there's an easy path forward with the current algorithms just by making our neural nets a hundred times faster and bigger. So, why is that? Basically, what I see in many experiments with neural nets right now is that they -- I'm going to use some jargon here. They under fit, meaning that they're not big enough or we don't train them long enough for them to exploit all of the information that there is in the data. And so they're not even able to learn the data by heart, right, which is the thing we usually want to avoid in machine learning. But that comes almost for free with these networks, and so we just have to press on the pedal of more capacity and we're almost sure to get an improvement here. All right. To just illustrate graphically that we have some room to approach the size of human brains, this picture was made up by my former student, Ian Goodfellow, where we see the sizes of different organisms and neural nets over the years so the DBN here was from 2006. Of the AlexNet is the breakthrough network of 2012 for computer vision, and the AdamNet is maybe a couple of years old. So, we see that the current technology is maybe between a bee and a frog in terms of size of the networks for about the same number of synapses. So, we've almost reached the kind of average number of synapses you see in natural brains, between a thousand and ten thousand. In terms of number of neurons, we're several orders of ranking away. So, I'm going to tell you a little bit about a stream of research we've been pushing in my lab, which is more connected to the computing challenge and potentially part of our implementation, which is can we train neural nets that have very low precision. So, we had a first paper at ICLR. By the way, ICLR is the deep learning conference, and it happens every year now. Yann Lecun and I started it in 2013 and it's been an amazing success that year and every year since then. We're going to have a third version next May. And so we wanted to know how many bits do you actually require. Of course, people have been asking these kinds of questions for decades. But using sort of the current state of the art neural nets and we found 12, and I can show you some pictures how we got these numbers on different data sets and comparing different ways of representing numbers with fixed point or dynamic fixed point. And also, depending on where I use those bits, you actually need less bits in the activations than in the weights. So, you need more rescission in the weights. So, that was the first investigation. But then we thought -- so that's the -- for the weights, that's the number of bits you actually need to keep the information that you are accumulating from many examples. But when you actually run your system during training, especially, maybe you don't need all those bits. Maybe you can get the same effect by introducing noise and discretizing randomly those weights to plus one or minus one. So, that's exactly what we did. The idea is -- the cute idea here is that we can replace a real number by a binary number that has the same expected value by, you know, sampling those two values with a probability such as that the expected value is the correct one. And now, instead of having a real number to multiply, we have a bit to multiply, which is easy. It's just an addition. And why would we do that? Because we want to get rid of multiplications. Multiplications is what takes up most of the surface area on chips for doing neural nets. So, we had a first try at this, and this is going to be presented at the next NIPS in the next few weeks in Montreal. And it allows us to get rid of the multiplications in the feed forward computation and in the backward computation where we compute gradients. But we remained with the multiplication -- even if you discretize the weights, there is another multiplication at the end of the back prop where you multiply -- you don't multiply weights. You multiply activations and gradients. So, if those two things are real valued, you still need regular multiplication. So, we -- yes, so that's going to be in the NIPS paper. But the new thing we did is to get rid of that last multiplication that we need for the update of the weight, so the delta W is a change in the weights, DC DA is the gradient that's propagated back, and H is the activations. It's some jargon. But anyway, we have to do this multiplication, and so, well, the only thing we need to do is take one of these two numbers and replace it again by a stochastic quantity that is not going to require multiplication. So, instead of binarizing it, we quantize it stochastically to its mantissa. In other words, we get rid of -- to its exponent. We get rid of the mantissa. In other words, we represent it, we -- we represent it in a log scale. So, if you do that, again, you can map the activations to some values that are just powers of two. And now multiplication is just addition. This is an old trick. I mean, the trick of using powers of two is an old trick. The new trick is to do this stochastically so that you actually get the right things in average and stochastic gradient works perfectly fine. And so we're running some experiments on a few data sets showing that you get a bit of a slowdown because of the extra noise. But so the green and yellow curve here are where this strict with binarized weights and quantized, stochastically quantize the calculations. And the good news is, well, it learns even better, actually, because this noise acts as a regularizer. Now, this -- yes, this is pretty good news. Now, why is this interesting? It's interesting because we can probably -- for two reasons. One is for hardware implementations, this could be useful. The other reasons is that it connects with what the brain -- with spikes, right. So the idea with -- you can think of, if I go back here, when you replace activations by some stoke tick binary values that have the right expected value, you're introducing noise. But you're actually not changing that much the computation of the gradient. And so it would be reasonable for brains to use the same trick if they could save on the hardware side. Okay. So now let me move on to my second challenge, which has to do with language and, in particular, language understanding. There's a lot of work to do in this direction, but the progress in the last few years is pretty impressive. Actually, I was part of the beginning of that process of extending the realm of application of neural networks to language. So, in 2000, we had a NIPS paper where we introduced the idea of learning to represent probability distributions over sequences of words. In other words, being able to generate sequences of words that look like English by decomposing the problem in two parts. That's a kind of a central element that you find in neural nets and especially in deep learning, which is think of the problem not as going directly from inputs to outputs, but breaking the problem into two parts. One is the representation part. So, learning to represent words here by mapping each word to a fixed size, real valued vector. And then taking those representations and mapping them to the answers you care about. And here, that's predicting the next word. It turned out that those representations of words that we learned have incredibly nice properties and they capture a lot of the semantic aspects of words. And there's been tons and tons of papers to analyze these things, to use them in applications. So, these are called word vectors, word embeddings, and they're used all over the place and becoming like commonplace in natural language processing. In the last couple of years, there's been a kind of an exciting observation about these word embeddings, which is that they capture analogies, even though they were not programmed for that. So, what do I mean? What I mean is that if you take the vector which is for each word and you do operations on them, like subtract and add them, you can get interesting things coming up. So, for example, if you take the vector for queen and you subtract the vector for king, you get a new vector, and that vector is pretty much aligned with the vector that you get from subtracting the representation for woman from the representation for man. So, that means that you could do something like woman minus man, plus king and get queen, right. So, it can answer the question, you know, what is to king what woman is to man, and it would find queen. So, that's interesting, and there is some nice explanations that we're starting to understand why this is happening.