Placeholder Image

Subtitles section Play video

  • If there’s one deep net that has completely dominated the machine vision space in recent

  • years, it’s certainly the convolutional neural net, or CNN. These nets are so influential

  • that theyve made Deep Learning one of the hottest topics in AI today. But they can be

  • tricky to understand, so let’s take a closer look and see how they work.

  • CNNs were pioneered by Yann Lecun of New York University, who also serves as the director

  • of Facebook's AI group. It is currently believed that Facebook uses a CNN for its facial recognition

  • software.

  • A convolutional net has been the go to solution for machine vision projects in the last few

  • years. Early in 2015, after a series of breakthroughs by Microsoft, Google, and Baidu, a machine

  • was able to beat a human at an object recognition challenge for the first time in the history

  • of AI.

  • It’s hard to mention a CNN without touching on the ImageNet challenge. ImageNet is a project

  • that was inspired by the growing need for high-quality data in the image processing

  • space. Every year, the top Deep Learning teams in the world compete with each other to create

  • the best possible object recognition software. Going back to 2012 when Geoff Hinton’s team

  • took first place in the challenge, every single winner has used a convolutional net as their

  • model. This isn’t surprising, since the error rate of image detection tasks has dropped

  • significantly with CNNs, as seen in this image.

  • Have you ever struggled while trying to learn about CNNs? If so, please comment and share

  • your experiences.

  • Well keep our discussion of CNNs high level, but if youre inclined to learn about the

  • math, be sure to check out Andrej Karpathy’s amazing CS231n course notes on these nets.

  • There are many component layers to a CNN, and we will explain them one at a time. Let’s

  • start with an analogy that will help describe the first component, which is theconvolutional

  • layer

  • Imagine that we have a wall, which will represent a digital image. Also imagine that we have

  • a series of flashlights shining at the wall, creating a group of overlapping circles. The

  • purpose of these flashlights is to seek out a certain pattern in the image, like an edge

  • or a color contrast for example. Each flashlight looks for the exact same pattern as all the

  • others, but they all search in a different section of the image, defined by the fixed

  • region created by the circle of light. When combined together, the flashlights form what’s

  • a called a filter. A filter is able to determine if the given pattern occurs in the image,

  • and in what regions. What you see in this example is an 8x6 grid of lights, which is

  • all considered to be one filter.

  • Now let’s take a look from the top. In practice, flashlights from multiple different filters

  • will all be shining at the same spots in parallel, simultaneously detecting a wide array of patterns.

  • In this example, we have four filters all shining at the wall, all looking for a different

  • pattern. So this particular convolutional layer is an 8x6x4, 3-dimensionsal grid of

  • these flashlights.

  • Now let’s connect the dots of our explanation: - Why is it called a convolutional net? The

  • net uses the technical operation of convolution to search for a particular pattern. While

  • the exact definition of convolution is beyond the scope of this video, to keep things simple,

  • just think of it as the process of filtering through the image for a specific pattern.

  • Although one important note is that the weights and biases of this layer affect how this operation

  • is performed: tweaking these numbers impacts the effectiveness of the filtering process.

  • - Each flashlight represents a neuron in the CNN. Typically, neurons in a layer activate

  • or fire. On the other hand, in the convolutional layer, neurons perform thisconvolution

  • operation. We're going to draw a box around one set of flashlights to make things look

  • a bit more organized.

  • - Unlike the nets we've seen thus far where every neuron in a layer is connected to every

  • neuron in the adjacent layers, a CNN has the flashlight structure. Each neuron is only

  • connected to the input neurons it "shines" upon.

  • The neurons in a given filter share the same weight and bias parameters. This means that,

  • anywhere on the filter, a given neuron is connected to the same number of input neurons

  • and has the same weights and biases. This is what allows the filter to look for the

  • same pattern in different sections of the image. By arranging these neurons in the same

  • structure as the flashlight grid, we ensure that the entire image is scanned.

  • The next two layers that follow are RELU and pooling, both of which help to build up the

  • simple patterns discovered by the convolutional layer. Each node in the convolutional layer

  • is connected to a node that fires like in other nets. The activation used is called

  • RELU, or rectified linear unit. CNNs are trained using backpropagation, so the vanishing gradient

  • is once again a potential issue. For reasons that depend on the mathematical definition

  • of RELU, the gradient is held more or less constant at every layer of the net. So the

  • RELU activation allows the net to be properly trained, without harmful slowdowns in the

  • crucial early layers.

  • The pooling layer is used for dimensionality reduction. CNNs tile multiple instances of

  • convolutional layers and RELU layers together in a sequence, in order to build more and

  • more complex patterns. The problem with this is that the number of possible patterns becomes

  • exceedingly large. By introducing pooling layers, we ensure that the net focuses on

  • only the most relevant patterns discovered by convolution and RELU. This helps limit

  • both the memory and processing requirements for running a CNN.

  • Together, these three layers can discover a host of complex patterns, but the net will

  • have no understanding of what these patterns mean. So a fully connected layer is attached

  • to the end of the net in order to equip the net with the ability to classify data samples.

  • Let’s recap the major components of a CNN. A typical deep CNN has three sets of layers

  • – a convolutional layer, RELU, and pooling layersall of which are repeated several

  • times. These layers are followed by a few fully connected layers in order to support

  • classification. Since CNNs are such deep nets, they most likely need to be trained using

  • server resources with GPUs.

  • Despite the power of CNNs, these nets have one drawback. Since they are a supervised

  • learning method, they require a large set of labelled data for training, which can be

  • challenging to obtain in a real-world application. In the next video, well shift our attention

  • to another important deep learning modelthe Recurrent Net.

If there’s one deep net that has completely dominated the machine vision space in recent

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it