Placeholder Image

Subtitles section Play video

  • YOSHUA BENGIO: [INAUDIBLE].

  • Thank you [INAUDIBLE].

  • So I'll talk about [INAUDIBLE].

  • I'll talk about representations and learning

  • representations.

  • And the word deep here, I'll explain what it means.

  • So my goal is to contribute to building intelligent machines,

  • also known as AI.

  • And how do we get a machine to be smart--

  • to take good decisions?

  • Well, it needs knowledge.

  • [INAUDIBLE]

  • [? researchers ?]

  • from the early days--

  • '50s, '60s, '70s--

  • tried to give the knowledge to the machine--

  • the knowledge we have exclusively.

  • And it didn't work quite as well as was hoped.

  • One reason is that a lot of our knowledge is not something

  • we can communicate verbally and we can

  • write that in a program.

  • So that knowledge has to be taken somewhere else.

  • And basically what we have found is you can get that

  • knowledge through observing the world around us.

  • That means learning.

  • OK-- so we need learning for AI.

  • What is learning?

  • What is machine learning?

  • It's not about learning things by heart.

  • That's just a fact.

  • What it is about is generalizing from the examples

  • you've seen to new examples.

  • And what I like to tell my students is it's taking

  • probability mass-- that is, on the training examples and

  • somehow guessing where it should go-- which new

  • configurations of the things we see make

  • sense or are plausible.

  • This is what learning is about.

  • It's guesswork.

  • At first we can measure [INAUDIBLE] we can guess.

  • And I'll mention something about dimensionality and

  • geometry that comes up when we think about this [INAUDIBLE].

  • And one of the messages will be that we can maybe fight

  • this [? dimensionality ?]

  • problem by allowing the machine to discover underlying

  • causes-- the underlying factors that explain the data.

  • And this is a little bit like [INAUDIBLE] is about.

  • So let's start from learning, an easy [INAUDIBLE]

  • of learning.

  • Let's say we observe x,y pairs where x is a number--

  • y is a number.

  • And the stars here represent the examples we've seen of x,y

  • configurations.

  • So we want to [? generalize ?] for new configurations.

  • In other words, for example, in this problem, typically we

  • want to predict a y given a new x.

  • And there's an underlying relationship between y and x,

  • meaning the expected value of the y given x, which is given

  • with this purple curve.

  • But we don't know it.

  • That's the problem with machine learning.

  • We're trying to discover something

  • we don't know already.

  • And we can guess some function.

  • This is the predicted or learned function.

  • So how could we go about this?

  • One of the most basic principles by which machine

  • learning algorithms are able to do this is assume something

  • very simple about the world around us-- about the data

  • we're getting or the function we're trying to discover.

  • It's just assuming that the function we're trying to

  • discover is smooth, meaning if I know the value of the

  • function that's come from the x, and I want to know the

  • value at some nearby point x prime, then it's reasonable to

  • assume that the value x prime of the function we want to

  • learn is close to the value of x.

  • That's it.

  • I mean, you can formalize that and [INAUDIBLE] in many

  • different ways and exploit it in many ways.

  • And what it means here is if I ask you why

  • should we at this point--

  • what I'm going to do is look up the value of y that I

  • observed at nearby points.

  • And combining these--

  • make a reasonable guess like this one.

  • And if I do that on problems like this, it's actually going

  • to work quite well.

  • And a large fraction of the applications that we're

  • sharing use this principle.

  • And [INAUDIBLE]

  • enough of just this principle.

  • But if we only rely on this principle virtualization,

  • we're going to be in trouble.

  • That's one of the messages I want to explain here.

  • So why are we going to be in trouble?

  • Well, basically we're doing some kind of interpolation.

  • So if I see enough examples--

  • the green stars here-- to cover the ups and down of the

  • function I'm trying to learn, then I'm going to be fine.

  • But what if the function I want to learn has many more

  • ups and downs than I can possibly observe through data?

  • Because even Google has a finite number of examples.

  • Even if you have millions or billions of examples, the

  • functions we want to learn for AI are not like this one.

  • They have--

  • the number of configurations of articles of interest-- that

  • may be exponentially large.

  • So something maybe bigger than the number of

  • items in the universe.

  • So there's no way we're going to have enough examples to

  • cover all the configurations.

  • For example, think of the number of different English

  • sentences, which is something that Google is interested in.

  • And this problem is illustrated by the so-called

  • curse of dimensionality where you consider what happens when

  • you have not just one variable but many variables and all of

  • their configurations.

  • How many configurations of [? N ?] variables do you have?

  • Well, you have an exponential number of configurations.

  • So if I wanted to learn about a single

  • variable, I can just divide--

  • [? it ?]

  • [? takes ?] a real variable.

  • And I divide its value into intervals.

  • And I count how many of those bins I've seen in my data.

  • I can estimate probability of different intervals coming up.

  • So that's easy Because i only want to know about a small

  • number of different configurations.

  • But if I'm looking at two variables, then the number of

  • configurations may be [INAUDIBLE]

  • [? square ?]

  • [? bigger, ?] and [? it'd ?] have [? 390-- ?] even more.

  • But typically, I'm going to have hundreds-- if you're

  • thinking about images, it's thousands-- tens of

  • thousands-- hundreds of thousands.

  • So it's crazy how many configurations there are.

  • So how do we possibly generalize to new

  • configurations?

  • We cannot just break up this space into small cells and

  • count how many things happen in each cell because the new

  • examples that we want to [? carry-- ?] new

  • configurations that [INAUDIBLE] asked about might

  • be in some region where we hadn't [INAUDIBLE].

  • So that's the problem of generalizing [INAUDIBLE].

  • So there's one thing that can help us, but it's not going to

  • be sufficient.

  • It's something that happens with the iPhones.

  • It's very often [INAUDIBLE] vision, [INAUDIBLE]

  • processing and understanding and many other problems where

  • the set of configurations of variables that are plausible--

  • that can happen in the real world--

  • occupy a very small volume of all this set of possible

  • configurations.

  • So let me give an example.

  • In images, if I choose the pixels in an image randomly--

  • in other words, if I sample an image from completely uniform

  • distribution, I'm going to get things like this.

  • Just [INAUDIBLE].

  • And I can repeat this for eons and eons.

  • And I'm never going to assemble something that looks

  • like a face.

  • So what it means is that faces--

  • images of faces--

  • are very rare in the space of images.

  • They occupy a very small volume, much less than what

  • this picture would suggest.

  • And so this is a very important hint.

  • It means that actually the task is to find out where this

  • distribution concentrates.

  • I have another example here.

  • If you take the image of a four like this one and you do

  • some geometry transformations to it like rotating it,

  • scaling it, you get slightly different images.

  • And if at each point, you allow yourself to make any of

  • these transformations, you can create a so-called manifold--

  • so a surface of possible images.

  • Each point here corresponds to a different image.

  • And the number of different changes that you make is

  • basically the dimensionality of this manifold.

  • So in this case, even though the data lives in the high

  • dimension space, the actual variations we care about are

  • of low dimensionality.

  • And knowing that, we can maybe do

  • better in terms of learning.

  • One thing about curves of dimensionality is I don't like

  • the name curves of dimensionality because it's

  • not really dimensionality.

  • You can have many dimensions but have

  • a very simple function.

  • What really matters is how many variations does the

  • function have-- how many ups and downs?

  • So we actually had some fairly [? cool ?] results about--

  • the number of examples you would need if you were only

  • relying on this [INAUDIBLE] assumption, essentially is

  • linear-- the number of ups and downs of the function

  • [INAUDIBLE].

  • So let's come back to this idea of learning where to put

  • probability [? math. ?]

  • So in machine learning, what we have is data.

  • Each example is a configuration of variables.

  • And we know that this configuration [? occurred ?]

  • in the real world.

  • So we can say the probability for this configuration.

  • So this is the [? space ?] of configuration

  • I'm showing in 2D.

  • So we know that this configuration is plausible.

  • [INAUDIBLE].

  • So we can just put a [? beacon ?]

  • [INAUDIBLE] here.

  • And we can put a [? beacon ?] at every example.

  • The question is how do we take this probability mass and sort

  • of give a little bit of that to other places.

  • In particular, we'd like to put mass in between if there

  • really was a manifold that has some structure and if we could

  • discover that structure, it would be great.

  • So the classical machine learning way of doing things

  • is say that the distribution function-- the function that

  • you're trying to [? learn ?] in this case is smooth.

  • So if it's very probable here, it must be also probable in

  • the neighborhood.

  • So we can just do some mathematical equation that

  • will shift some mass from here to the [? different ?]

  • neighbors.

  • Then we get a distribution like this as our model.

  • And that works reasonably well.

  • But it's not the right thing to do.

  • It's putting mass in many directions

  • we don't care about.

  • Instead, what we're going to do is to discover that there

  • is something about this data.

  • There is some structure.

  • There is some abstraction that allows us to be very specific

  • about where we're going to put probability mass.

  • And we might discover with something like this, which in

  • 2D doesn't look like a big difference.

  • But in high dimensions, the number of directions you're

  • allowed to move here is very small compared to the number

  • of dimensions here.

  • And the volume goes exponentially with dimension.

  • So you can have a huge [? gain ?] by guessing

  • probably which direction things move--

  • are allowed to keep high probability.

  • So, now to the core of this presentation which is about

  • representation learning.

  • I've talked about learning in general

  • and some of the issues--

  • some of the challenges with applying learning to AI.

  • Now, when you look at how machine learning is applied in

  • industry, what people do for 90% of time-- what they do

  • with the effort of engineers is not really

  • improve machine learning.

  • They use existing machine learning.

  • But to make the machine learning [INAUDIBLE] work

  • well, they do [INAUDIBLE]

  • feature engineering.

  • So that means taking the raw data and transforming it--

  • extracting some features-- deciding what matters--

  • throwing away