Subtitles section Play video Print subtitles RUSS ALTMAN: It is now my pleasure to introduce my colleague and tireless co-organizer, Dr. Fei Fei Li. She's an associate professor of Computer Science and Psychology. She's director of the AI Lab here at Stanford. And she's done breakthrough research in human vision, high level visual recognition, and computational neuroscience, some of which I think we might hear about now-- Fei Fei. [APPLAUSE] FEI-FEI LI: Good evening, everyone. It's quite an honor to be here. And I'm going to share with you some of my recent work in visual intelligence. So we're going to begin 543 million years ago. Simple organisms lived in the vast ocean of this Precambrian age. And they floated around waiting for food to come by or becoming someone else's food. So life was very simple and so was the animal kingdom. There were only a few animal species around. And then something happened. The first trilobites started to develop eyes and life just changed forever after that. Suddenly animals can go seek food. Preys have to run from predators. And the number of animal species just exploded in an exceedingly short period of time. Evolutionary biologists call this period the Cambrian Explosion, or the big bang of evolution, and attributed to the development of vision to be the main factor that caused this animal speciation. So ever since then, vision played a very important role in animals for them to survive, to seek food, to navigate, to manipulate, and so on. And the same is true for humans. We use vision to live, to work, to communicate, and to understand this world. In fact, after 540 millions of evolution, the visual system is the biggest sensory system in our brain. And more than half of the brain neurons are involved in vision. So while animals have seen the light of the world 540 million years ago, our machines and computers are still very much in the dark age. We have security cameras everywhere but they don't alert us when a child is drowning in a swimming pool. Hundreds of hours of videos are uploaded every minute to the YouTube servers, yet we do not have the technology to tag and recognize the contents. We have drones flying over massive lands taking an enormous amount of imageries, but we do not have a method or algorithm to understand the landscape of the earth. So in short as a society, we're still pretty much collectively blind because our smartest machines and computers are blind. So as a computer vision scientist we seek to develop artificial intelligence algorithms that can learn about the visual world and recognize the contents in the images and the videos. We've got a daunting task to shine light on our digital world and we don't have 540 million years to do this. So the first step towards this goal is to recognize objects because they are the building blocks of our visual world. In the simplest terms, imagine this teaching process of teaching computers to recognize objects by first showing them a few training images of a particular object-- let's say a cat. And then we design a mathematical model that can learn from these training images. How hard could this be? Humans do this effortlessly. So that's what we tried at the beginning. In a straightforward way, we tried to express objects by designing parts and the configurations of their parts just using-- such as using simple geometric shapes to define a cat model. Well, there are lots of different cats. So for this one, we cannot use our original models. We have to do another model. Well, what about these cats? So now you get the idea. Even something as simple as a household pet can pose an infinite number of variations for us to model. And that's just one object. But this is what many of us were doing at that time. We keep designing-- tuning our algorithms-- and waiting for that magical algorithm to be able to model all the variations of an object using just a few training images. But about nine years ago, a very profound, but simple thought changed-- observation-- changed my thinking. This is not how children learn. We don't tell kids how to see. They do it by experiencing the real world and by experiencing real life examples. If you consider a child's eyes as a pair of biological cameras, they take a picture every 200 milliseconds. So by age three, a child would have seen hundreds of millions of images. And that's the amount of data we're talking about to develop a vision system. So before we come up with a better algorithm, we should provide our computer algorithms the kind of data that children were experiencing in their developmental years. And once we realized this, I know what we need to do. We need to collect a data set that has far more images than we have ever used before in machine learning and computer vision-- thousands of times larger than the standard dataset that was being used at the time. So together with my colleague Professor Kai Li and student Jia Deng, we started this ImageNet project back in 2007. After three years of very hard work, by 2009 the ImageNet project delivered a database of 15 million images organized across 22,000 categories of objects and things organized by every day English words. In quality and quantity, this was an unprecedented scale for the field of computer vision and machine learning. So more than ever we're now poised to tackle the problem of object recognition using ImageNet. This is the first take of the message I'm going to deliver today-- much of learning is about big data. This is a child's perspective. As it turned out, the wealth of information provided by ImageNet was a perfect match for a particular class of machine learning algorithms called the Convolutional Neural Network pioneered by computer scientists Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun, back in the 1970s and 80s. Just like the brain is consisted of billions of neurons, a basic operating unit of the Convolutional Neural Network is a neuron-like node that gets input from other nodes and send output to others. More over, hundreds and thousands of these neuron-like nodes are layered together in a hierarchical fashion, also similar to the brain. This is a typical Convolutional Neural Network model we use in our lab to train our object recognition algorithm. It's consisted of 24 million nodes, 140 million parameters, and 15 billion connections. With the massive data provided by ImageNet and the modern computing hardware like CPUs and GPUs to train this humongous model, the Convolutional Neural Network algorithm blossomed in a way that no one had expected. It became the winning architecture for object recognition. Here is what the computer tells us, the image contains a cat and where the cat is. Here is a boy and his teddy bear. A dog on the beach with a person and a kite-- so, so far, what we have seen is to teach computers to recognize objects. This is like a young child learning to utter the first few nouns. It's a very impressive achievement, but it's only in the beginning. Children soon hit another developmental milestone. And they begin to communicate in sentences and tell stories. So instead of saying-- CHILD 1: That's a cat sitting in a bed. FEI-FEI LI: Right, this is a three-year-old telling us the story of the scene instead of just labeling it as a cat. Here's one more. CHILD 2: Those are people. They're going on a airplane. That's a big airplane! FEI-FEI LI: Very cute-- so to train a computer to see a picture and generate a story, the marriage between big data and machine learning algorithm has to take another step, just like our brain integrates vision and language. We use a deep learning algorithm to learn to connect the visual snippets with the words and phrases to generate sentences. Now I'm going to show you what a computer would say for the first time when it sees a picture. COMPUTER VOICE: A large airplane sitting on top of an airport runway. FEI-FEI LI: Not as cute, but still good. COMPUTER VOICE: A man is standing next to an elephant. FEI-FEI LI: So this is an algorithm we did to generate one sentence. Recently, we've taken the storytelling algorithm a step further and created a deep learning model that can generate multiple sentences and phrases in a picture. Our algorithm is computationally very efficient, so it can process almost in real time. Here I'm showing you the algorithm generating regions and region descriptions for every frame of this video. So we have successfully used neural network algorithms to train computer vision models to begin telling the story of the visual world. This is a brain-inspired perspective. With the availability of data and the blossoming of the powerful neural network models, we begin to see unprecedented advances in the field-- in all areas of computer vision-- both in my own lab, as well as in our field. Now let me show you a few more examples and their potential applications. Collaborating with Google's YouTube team, we developed a deep learning algorithm that can classify hundreds of sports types. We hope one day this technology can help us to manage, index, and search massive amount of photos and videos in big data repositories. Working with a European train station, we used hundreds of computer vision sensors to help observing and tracking the behaviors of millions of travelers and customers. This provided invaluable information for the train station to collect data analytics of their customers and to optimize the use of space. Furthermore, we developed a reinforcement learning algorithm and deep learning model to process human activity understanding in an extremely efficient manner, achieving the same results as a state-of-the-art algorithm in action detection using only 2% of the video frames. In a different work, we used step sensor to learn about human movements in very great details. We collaborate with the Stanford hospitals to deploy this technology to help the hospital to improve health hygiene and workflow practices. And in this work, we train the computer vision algorithm that can do better object recognition than humans-- at least some of us-- by recognizing 3,000 types of cars by make, model, year. We apply this to 50 million Google Street View images over 200 American cities and learned very interesting social statistics, like a visual census. We learned that the average car price can correlate very well with average household incomes in cities. Or they can correlate very well with crime rates in cities. Or even voting patterns-- let's wait till later this year. So in short, as a technologist, nothing excites me more to be seeing the potentials of computer vision algorithms to solve real world problems. This is a technologists perspective.