Subtitles section Play video Print subtitles YOSHUA BENGIO: [INAUDIBLE]. Thank you [INAUDIBLE]. So I'll talk about [INAUDIBLE]. I'll talk about representations and learning representations. And the word deep here, I'll explain what it means. So my goal is to contribute to building intelligent machines, also known as AI. And how do we get a machine to be smart-- to take good decisions? Well, it needs knowledge. [INAUDIBLE] [? researchers ?] from the early days-- '50s, '60s, '70s-- tried to give the knowledge to the machine-- the knowledge we have exclusively. And it didn't work quite as well as was hoped. One reason is that a lot of our knowledge is not something we can communicate verbally and we can write that in a program. So that knowledge has to be taken somewhere else. And basically what we have found is you can get that knowledge through observing the world around us. That means learning. OK-- so we need learning for AI. What is learning? What is machine learning? It's not about learning things by heart. That's just a fact. What it is about is generalizing from the examples you've seen to new examples. And what I like to tell my students is it's taking probability mass-- that is, on the training examples and somehow guessing where it should go-- which new configurations of the things we see make sense or are plausible. This is what learning is about. It's guesswork. At first we can measure [INAUDIBLE] we can guess. And I'll mention something about dimensionality and geometry that comes up when we think about this [INAUDIBLE]. And one of the messages will be that we can maybe fight this [? dimensionality ?] problem by allowing the machine to discover underlying causes-- the underlying factors that explain the data. And this is a little bit like [INAUDIBLE] is about. So let's start from learning, an easy [INAUDIBLE] of learning. Let's say we observe x,y pairs where x is a number-- y is a number. And the stars here represent the examples we've seen of x,y configurations. So we want to [? generalize ?] for new configurations. In other words, for example, in this problem, typically we want to predict a y given a new x. And there's an underlying relationship between y and x, meaning the expected value of the y given x, which is given with this purple curve. But we don't know it. That's the problem with machine learning. We're trying to discover something we don't know already. And we can guess some function. This is the predicted or learned function. So how could we go about this? One of the most basic principles by which machine learning algorithms are able to do this is assume something very simple about the world around us-- about the data we're getting or the function we're trying to discover. It's just assuming that the function we're trying to discover is smooth, meaning if I know the value of the function that's come from the x, and I want to know the value at some nearby point x prime, then it's reasonable to assume that the value x prime of the function we want to learn is close to the value of x. That's it. I mean, you can formalize that and [INAUDIBLE] in many different ways and exploit it in many ways. And what it means here is if I ask you why should we at this point-- what I'm going to do is look up the value of y that I observed at nearby points. And combining these-- make a reasonable guess like this one. And if I do that on problems like this, it's actually going to work quite well. And a large fraction of the applications that we're sharing use this principle. And [INAUDIBLE] enough of just this principle. But if we only rely on this principle virtualization, we're going to be in trouble. That's one of the messages I want to explain here. So why are we going to be in trouble? Well, basically we're doing some kind of interpolation. So if I see enough examples-- the green stars here-- to cover the ups and down of the function I'm trying to learn, then I'm going to be fine. But what if the function I want to learn has many more ups and downs than I can possibly observe through data? Because even Google has a finite number of examples. Even if you have millions or billions of examples, the functions we want to learn for AI are not like this one. They have-- the number of configurations of articles of interest-- that may be exponentially large. So something maybe bigger than the number of items in the universe. So there's no way we're going to have enough examples to cover all the configurations. For example, think of the number of different English sentences, which is something that Google is interested in. And this problem is illustrated by the so-called curse of dimensionality where you consider what happens when you have not just one variable but many variables and all of their configurations. How many configurations of [? N ?] variables do you have? Well, you have an exponential number of configurations. So if I wanted to learn about a single variable, I can just divide-- [? it ?] [? takes ?] a real variable. And I divide its value into intervals. And I count how many of those bins I've seen in my data. I can estimate probability of different intervals coming up. So that's easy Because i only want to know about a small number of different configurations. But if I'm looking at two variables, then the number of configurations may be [INAUDIBLE] [? square ?] [? bigger, ?] and [? it'd ?] have [? 390-- ?] even more. But typically, I'm going to have hundreds-- if you're thinking about images, it's thousands-- tens of thousands-- hundreds of thousands. So it's crazy how many configurations there are. So how do we possibly generalize to new configurations? We cannot just break up this space into small cells and count how many things happen in each cell because the new examples that we want to [? carry-- ?] new configurations that [INAUDIBLE] asked about might be in some region where we hadn't [INAUDIBLE]. So that's the problem of generalizing [INAUDIBLE]. So there's one thing that can help us, but it's not going to be sufficient. It's something that happens with the iPhones. It's very often [INAUDIBLE] vision, [INAUDIBLE] processing and understanding and many other problems where the set of configurations of variables that are plausible-- that can happen in the real world-- occupy a very small volume of all this set of possible configurations. So let me give an example. In images, if I choose the pixels in an image randomly-- in other words, if I sample an image from completely uniform distribution, I'm going to get things like this. Just [INAUDIBLE]. And I can repeat this for eons and eons. And I'm never going to assemble something that looks like a face. So what it means is that faces-- images of faces-- are very rare in the space of images. They occupy a very small volume, much less than what this picture would suggest. And so this is a very important hint. It means that actually the task is to find out where this distribution concentrates. I have another example here. If you take the image of a four like this one and you do some geometry transformations to it like rotating it, scaling it, you get slightly different images. And if at each point, you allow yourself to make any of these transformations, you can create a so-called manifold--