Subtitles section Play video
-
RUSS ALTMAN: It is now my pleasure
-
to introduce my colleague and tireless co-organizer, Dr. Fei
-
Fei Li.
-
She's an associate professor of Computer Science
-
and Psychology.
-
She's director of the AI Lab here at Stanford.
-
And she's done breakthrough research in human vision,
-
high level visual recognition, and computational neuroscience,
-
some of which I think we might hear about now-- Fei Fei.
-
[APPLAUSE]
-
FEI-FEI LI: Good evening, everyone.
-
It's quite an honor to be here.
-
And I'm going to share with you some
-
of my recent work in visual intelligence.
-
So we're going to begin 543 million years ago.
-
Simple organisms lived in the vast ocean of this Precambrian
-
age.
-
And they floated around waiting for food to come by
-
or becoming someone else's food.
-
So life was very simple and so was the animal kingdom.
-
There were only a few animal species around.
-
And then something happened.
-
The first trilobites started to develop eyes and life
-
just changed forever after that.
-
Suddenly animals can go seek food.
-
Preys have to run from predators.
-
And the number of animal species just
-
exploded in an exceedingly short period of time.
-
Evolutionary biologists call this period the Cambrian
-
Explosion, or the big bang of evolution,
-
and attributed to the development of vision
-
to be the main factor that caused this animal speciation.
-
So ever since then, vision played a very important role
-
in animals for them to survive, to seek food, to navigate,
-
to manipulate, and so on.
-
And the same is true for humans.
-
We use vision to live, to work, to communicate,
-
and to understand this world.
-
In fact, after 540 millions of evolution,
-
the visual system is the biggest sensory system in our brain.
-
And more than half of the brain neurons are involved in vision.
-
So while animals have seen the light of the world
-
540 million years ago, our machines and computers
-
are still very much in the dark age.
-
We have security cameras everywhere
-
but they don't alert us when a child is
-
drowning in a swimming pool.
-
Hundreds of hours of videos are uploaded every minute
-
to the YouTube servers, yet we do not
-
have the technology to tag and recognize the contents.
-
We have drones flying over massive lands
-
taking an enormous amount of imageries,
-
but we do not have a method or algorithm
-
to understand the landscape of the earth.
-
So in short as a society, we're still pretty much
-
collectively blind because our smartest machines and computers
-
are blind.
-
So as a computer vision scientist
-
we seek to develop artificial intelligence algorithms that
-
can learn about the visual world and recognize the contents
-
in the images and the videos.
-
We've got a daunting task to shine light
-
on our digital world and we don't have
-
540 million years to do this.
-
So the first step towards this goal
-
is to recognize objects because they are the building
-
blocks of our visual world.
-
In the simplest terms, imagine this teaching process
-
of teaching computers to recognize objects
-
by first showing them a few training
-
images of a particular object-- let's say a cat.
-
And then we design a mathematical model
-
that can learn from these training images.
-
How hard could this be?
-
Humans do this effortlessly.
-
So that's what we tried at the beginning.
-
In a straightforward way, we tried
-
to express objects by designing parts and the configurations
-
of their parts just using-- such as using
-
simple geometric shapes to define a cat model.
-
Well, there are lots of different cats.
-
So for this one, we cannot use our original models.
-
We have to do another model.
-
Well, what about these cats?
-
So now you get the idea.
-
Even something as simple as a household pet
-
can pose an infinite number of variations for us to model.
-
And that's just one object.
-
But this is what many of us were doing at that time.
-
We keep designing-- tuning our algorithms--
-
and waiting for that magical algorithm
-
to be able to model all the variations of an object using
-
just a few training images.
-
But about nine years ago, a very profound, but simple thought
-
changed-- observation-- changed my thinking.
-
This is not how children learn.
-
We don't tell kids how to see.
-
They do it by experiencing the real world
-
and by experiencing real life examples.
-
If you consider a child's eyes as a pair
-
of biological cameras, they take a picture
-
every 200 milliseconds.
-
So by age three, a child would have seen
-
hundreds of millions of images.
-
And that's the amount of data we're talking about
-
to develop a vision system.
-
So before we come up with a better algorithm,
-
we should provide our computer algorithms the kind of data
-
that children were experiencing in their developmental years.
-
And once we realized this, I know what we need to do.
-
We need to collect a data set that has far more images
-
than we have ever used before in machine learning and computer
-
vision-- thousands of times larger
-
than the standard dataset that was being used at the time.
-
So together with my colleague Professor Kai
-
Li and student Jia Deng, we started this ImageNet project
-
back in 2007.
-
After three years of very hard work,
-
by 2009 the ImageNet project delivered a database
-
of 15 million images organized across 22,000 categories
-
of objects and things organized by every day English words.
-
In quality and quantity, this was an unprecedented scale
-
for the field of computer vision and machine learning.
-
So more than ever we're now poised
-
to tackle the problem of object recognition using ImageNet.
-
This is the first take of the message
-
I'm going to deliver today-- much
-
of learning is about big data.
-
This is a child's perspective.
-
As it turned out, the wealth of information provided
-
by ImageNet was a perfect match for a particular class
-
of machine learning algorithms called the Convolutional Neural
-
Network pioneered by computer scientists
-
Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun,
-
back in the 1970s and 80s.
-
Just like the brain is consisted of billions of neurons,
-
a basic operating unit of the Convolutional Neural Network
-
is a neuron-like node that gets input from other nodes
-
and send output to others.
-
More over, hundreds and thousands
-
of these neuron-like nodes are layered
-
together in a hierarchical fashion,
-
also similar to the brain.
-
This is a typical Convolutional Neural Network
-
model we use in our lab to train our object recognition
-
algorithm.
-
It's consisted of 24 million nodes, 140 million parameters,
-
and 15 billion connections.
-
With the massive data provided by ImageNet
-
and the modern computing hardware like CPUs and GPUs
-
to train this humongous model, the Convolutional Neural
-
Network algorithm blossomed in a way that no one had expected.
-
It became the winning architecture
-
for object recognition.
-
Here is what the computer tells us, the image contains a cat
-
and where the cat is.
-
Here is a boy and his teddy bear.
-
A dog on the beach with a person and a kite-- so,
-
so far, what we have seen is to teach
-
computers to recognize objects.
-
This is like a young child learning
-
to utter the first few nouns.
-
It's a very impressive achievement,
-
but it's only in the beginning.
-
Children soon hit another developmental milestone.
-
And they begin to communicate in sentences and tell stories.
-
So instead of saying--
-
CHILD 1: That's a cat sitting in a bed.
-
FEI-FEI LI: Right, this is a three-year-old
-
telling us the story of the scene
-
instead of just labeling it as a cat.
-
Here's one more.
-
CHILD 2: Those are people.
-
They're going on a airplane.
-
That's a big airplane!
-
FEI-FEI LI: Very cute-- so to train
-
a computer to see a picture and generate a story,
-
the marriage between big data and machine learning algorithm
-
has to take another step, just like our brain integrates
-
vision and language.
-
We use a deep learning algorithm to learn
-
to connect the visual snippets with the words and phrases
-
to generate sentences.
-
Now I'm going to show you what a computer would
-
say for the first time when it sees a picture.
-
COMPUTER VOICE: A large airplane sitting
-
on top of an airport runway.
-
FEI-FEI LI: Not as cute, but still good.
-
COMPUTER VOICE: A man is standing next to an elephant.
-
FEI-FEI LI: So this is an algorithm we
-
did to generate one sentence.
-
Recently, we've taken the storytelling algorithm a step
-
further and created a deep learning
-
model that can generate multiple sentences
-
and phrases in a picture.
-
Our algorithm is computationally very efficient,
-
so it can process almost in real time.
-
Here I'm showing you the algorithm generating
-
regions and region descriptions for every frame of this video.
-
So we have successfully used neural network algorithms
-
to train computer vision models to begin telling
-
the story of the visual world.
-
This is a brain-inspired perspective.
-
With the availability of data and the blossoming
-
of the powerful neural network models,
-
we begin to see unprecedented advances
-
in the field-- in all areas of computer
-
vision-- both in my own lab, as well as in our field.
-
Now let me show you a few more examples
-
and their potential applications.
-
Collaborating with Google's YouTube team,
-
we developed a deep learning algorithm
-
that can classify hundreds of sports types.
-
We hope one day this technology can help us to manage,
-
index, and search massive amount of photos and videos
-
in big data repositories.
-
Working with a European train station,
-
we used hundreds of computer vision sensors
-
to help observing and tracking the behaviors of millions
-
of travelers and customers.
-
This provided invaluable information
-
for the train station to collect data analytics
-
of their customers and to optimize the use of space.
-
Furthermore, we developed a reinforcement learning
-
algorithm and deep learning model to process human activity
-
understanding in an extremely efficient manner,
-
achieving the same results as a state-of-the-art algorithm
-
in action detection using only 2% of the video frames.
-
In a different work, we used step sensor
-
to learn about human movements in very great details.
-
We collaborate with the Stanford hospitals
-
to deploy this technology to help the hospital
-
to improve health hygiene and workflow practices.
-
And in this work, we train the computer vision algorithm
-
that can do better object recognition than humans--
-
at least some of us-- by recognizing 3,000 types of cars
-
by make, model, year.
-
We apply this to 50 million Google Street View
-
images over 200 American cities and learned
-
very interesting social statistics,
-
like a visual census.
-
We learned that the average car price
-
can correlate very well with average household
-
incomes in cities.
-
Or they can correlate very well with crime rates in cities.
-
Or even voting patterns-- let's wait till later this year.
-
So in short, as a technologist, nothing
-
excites me more to be seeing the potentials of computer vision
-
algorithms to solve real world problems.
-
This is a technologists perspective.