Placeholder Image

Subtitles section Play video

  • RUSS ALTMAN: It is now my pleasure

  • to introduce my colleague and tireless co-organizer, Dr. Fei

  • Fei Li.

  • She's an associate professor of Computer Science

  • and Psychology.

  • She's director of the AI Lab here at Stanford.

  • And she's done breakthrough research in human vision,

  • high level visual recognition, and computational neuroscience,

  • some of which I think we might hear about now-- Fei Fei.

  • [APPLAUSE]

  • FEI-FEI LI: Good evening, everyone.

  • It's quite an honor to be here.

  • And I'm going to share with you some

  • of my recent work in visual intelligence.

  • So we're going to begin 543 million years ago.

  • Simple organisms lived in the vast ocean of this Precambrian

  • age.

  • And they floated around waiting for food to come by

  • or becoming someone else's food.

  • So life was very simple and so was the animal kingdom.

  • There were only a few animal species around.

  • And then something happened.

  • The first trilobites started to develop eyes and life

  • just changed forever after that.

  • Suddenly animals can go seek food.

  • Preys have to run from predators.

  • And the number of animal species just

  • exploded in an exceedingly short period of time.

  • Evolutionary biologists call this period the Cambrian

  • Explosion, or the big bang of evolution,

  • and attributed to the development of vision

  • to be the main factor that caused this animal speciation.

  • So ever since then, vision played a very important role

  • in animals for them to survive, to seek food, to navigate,

  • to manipulate, and so on.

  • And the same is true for humans.

  • We use vision to live, to work, to communicate,

  • and to understand this world.

  • In fact, after 540 millions of evolution,

  • the visual system is the biggest sensory system in our brain.

  • And more than half of the brain neurons are involved in vision.

  • So while animals have seen the light of the world

  • 540 million years ago, our machines and computers

  • are still very much in the dark age.

  • We have security cameras everywhere

  • but they don't alert us when a child is

  • drowning in a swimming pool.

  • Hundreds of hours of videos are uploaded every minute

  • to the YouTube servers, yet we do not

  • have the technology to tag and recognize the contents.

  • We have drones flying over massive lands

  • taking an enormous amount of imageries,

  • but we do not have a method or algorithm

  • to understand the landscape of the earth.

  • So in short as a society, we're still pretty much

  • collectively blind because our smartest machines and computers

  • are blind.

  • So as a computer vision scientist

  • we seek to develop artificial intelligence algorithms that

  • can learn about the visual world and recognize the contents

  • in the images and the videos.

  • We've got a daunting task to shine light

  • on our digital world and we don't have

  • 540 million years to do this.

  • So the first step towards this goal

  • is to recognize objects because they are the building

  • blocks of our visual world.

  • In the simplest terms, imagine this teaching process

  • of teaching computers to recognize objects

  • by first showing them a few training

  • images of a particular object-- let's say a cat.

  • And then we design a mathematical model

  • that can learn from these training images.

  • How hard could this be?

  • Humans do this effortlessly.

  • So that's what we tried at the beginning.

  • In a straightforward way, we tried

  • to express objects by designing parts and the configurations

  • of their parts just using-- such as using

  • simple geometric shapes to define a cat model.

  • Well, there are lots of different cats.

  • So for this one, we cannot use our original models.

  • We have to do another model.

  • Well, what about these cats?

  • So now you get the idea.

  • Even something as simple as a household pet

  • can pose an infinite number of variations for us to model.

  • And that's just one object.

  • But this is what many of us were doing at that time.

  • We keep designing-- tuning our algorithms--

  • and waiting for that magical algorithm

  • to be able to model all the variations of an object using

  • just a few training images.

  • But about nine years ago, a very profound, but simple thought

  • changed-- observation-- changed my thinking.

  • This is not how children learn.

  • We don't tell kids how to see.

  • They do it by experiencing the real world

  • and by experiencing real life examples.

  • If you consider a child's eyes as a pair

  • of biological cameras, they take a picture

  • every 200 milliseconds.

  • So by age three, a child would have seen

  • hundreds of millions of images.

  • And that's the amount of data we're talking about

  • to develop a vision system.

  • So before we come up with a better algorithm,

  • we should provide our computer algorithms the kind of data

  • that children were experiencing in their developmental years.

  • And once we realized this, I know what we need to do.

  • We need to collect a data set that has far more images

  • than we have ever used before in machine learning and computer

  • vision-- thousands of times larger

  • than the standard dataset that was being used at the time.

  • So together with my colleague Professor Kai

  • Li and student Jia Deng, we started this ImageNet project

  • back in 2007.

  • After three years of very hard work,

  • by 2009 the ImageNet project delivered a database

  • of 15 million images organized across 22,000 categories

  • of objects and things organized by every day English words.

  • In quality and quantity, this was an unprecedented scale

  • for the field of computer vision and machine learning.

  • So more than ever we're now poised

  • to tackle the problem of object recognition using ImageNet.

  • This is the first take of the message

  • I'm going to deliver today-- much

  • of learning is about big data.

  • This is a child's perspective.

  • As it turned out, the wealth of information provided

  • by ImageNet was a perfect match for a particular class

  • of machine learning algorithms called the Convolutional Neural

  • Network pioneered by computer scientists

  • Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun,

  • back in the 1970s and 80s.

  • Just like the brain is consisted of billions of neurons,

  • a basic operating unit of the Convolutional Neural Network

  • is a neuron-like node that gets input from other nodes

  • and send output to others.

  • More over, hundreds and thousands

  • of these neuron-like nodes are layered

  • together in a hierarchical fashion,

  • also similar to the brain.

  • This is a typical Convolutional Neural Network

  • model we use in our lab to train our object recognition

  • algorithm.

  • It's consisted of 24 million nodes, 140 million parameters,

  • and 15 billion connections.

  • With the massive data provided by ImageNet

  • and the modern computing hardware like CPUs and GPUs

  • to train this humongous model, the Convolutional Neural

  • Network algorithm blossomed in a way that no one had expected.

  • It became the winning architecture

  • for object recognition.

  • Here is what the computer tells us, the image contains a cat

  • and where the cat is.

  • Here is a boy and his teddy bear.

  • A dog on the beach with a person and a kite-- so,

  • so far, what we have seen is to teach

  • computers to recognize objects.

  • This is like a young child learning

  • to utter the first few nouns.

  • It's a very impressive achievement,

  • but it's only in the beginning.

  • Children soon hit another developmental milestone.

  • And they begin to communicate in sentences and tell stories.

  • So instead of saying--

  • CHILD 1: That's a cat sitting in a bed.

  • FEI-FEI LI: Right, this is a three-year-old

  • telling us the story of the scene

  • instead of just labeling it as a cat.

  • Here's one more.

  • CHILD 2: Those are people.

  • They're going on a airplane.

  • That's a big airplane!

  • FEI-FEI LI: Very cute-- so to train

  • a computer to see a picture and generate a story,

  • the marriage between big data and machine learning algorithm

  • has to take another step, just like our brain integrates

  • vision and language.

  • We use a deep learning algorithm to learn

  • to connect the visual snippets with the words and phrases

  • to generate sentences.

  • Now I'm going to show you what a computer would

  • say for the first time when it sees a picture.

  • COMPUTER VOICE: A large airplane sitting

  • on top of an airport runway.

  • FEI-FEI LI: Not as cute, but still good.

  • COMPUTER VOICE: A man is standing next to an elephant.

  • FEI-FEI LI: So this is an algorithm we

  • did to generate one sentence.

  • Recently, we've taken the storytelling algorithm a step

  • further and created a deep learning

  • model that can generate multiple sentences

  • and phrases in a picture.

  • Our algorithm is computationally very efficient,

  • so it can process almost in real time.

  • Here I'm showing you the algorithm generating

  • regions and region descriptions for every frame of this video.

  • So we have successfully used neural network algorithms

  • to train computer vision models to begin telling

  • the story of the visual world.

  • This is a brain-inspired perspective.

  • With the availability of data and the blossoming

  • of the powerful neural network models,

  • we begin to see unprecedented advances

  • in the field-- in all areas of computer

  • vision-- both in my own lab, as well as in our field.

  • Now let me show you a few more examples

  • and their potential applications.

  • Collaborating with Google's YouTube team,

  • we developed a deep learning algorithm

  • that can classify hundreds of sports types.

  • We hope one day this technology can help us to manage,

  • index, and search massive amount of photos and videos

  • in big data repositories.

  • Working with a European train station,

  • we used hundreds of computer vision sensors

  • to help observing and tracking the behaviors of millions

  • of travelers and customers.

  • This provided invaluable information

  • for the train station to collect data analytics

  • of their customers and to optimize the use of space.

  • Furthermore, we developed a reinforcement learning

  • algorithm and deep learning model to process human activity

  • understanding in an extremely efficient manner,

  • achieving the same results as a state-of-the-art algorithm

  • in action detection using only 2% of the video frames.

  • In a different work, we used step sensor

  • to learn about human movements in very great details.

  • We collaborate with the Stanford hospitals

  • to deploy this technology to help the hospital

  • to improve health hygiene and workflow practices.

  • And in this work, we train the computer vision algorithm

  • that can do better object recognition than humans--

  • at least some of us-- by recognizing 3,000 types of cars

  • by make, model, year.

  • We apply this to 50 million Google Street View

  • images over 200 American cities and learned

  • very interesting social statistics,

  • like a visual census.

  • We learned that the average car price

  • can correlate very well with average household

  • incomes in cities.

  • Or they can correlate very well with crime rates in cities.

  • Or even voting patterns-- let's wait till later this year.

  • So in short, as a technologist, nothing

  • excites me more to be seeing the potentials of computer vision

  • algorithms to solve real world problems.

  • This is a technologists perspective.