Lesson 1: Practical Deep Learning for Coders - VoiceTube: Learn English through videos!

Subtitles section Play video

Rachel and I started fast.ai with the idea of making neural networks uncool again.
It is a grand plan, to be sure, because they are apparently terribly cool.
But really there are some things that we want to see improve and that's why we're doing
this course.
We're actually not making any money out of this course.
We're donating our fees both to the diversity fellowships that we're running and also to
the Fred Hollows Foundation.
I would like to briefly give a quick pitch to the Fred Hollows Foundation (for those
of you who aren't aware of it), because as you know, deep-learning is fantastic for computer
vision; it's basically allowing computers to see for the first time.
What you might not realize is that there are somethig like 3 or 4 million people in the
world who can't see because they have something called "cataract blindness".
Cataract blindness can be cured for $25 per eye and actually the group of people who got
that price down from thousands of dollars to $25 was Fred Hollows, who was the Australian
of the Year some years ago.
He's passed away now, but his legacy is in this foundation where if you donate $25, you
are giving somebody their sight back.
So, as you learn to teach computers how to see we are also, Rachel and I, are also donating
our fees from this to helping humans to see.
So we think that's a nice little touch.
So we're doing this both to help the Fred Hollows Foundation but more importantly to
help something we care a lot about, which is making deep-learning more accessible.
It is currently terribly exclusive.
As I'm sure you've noticed, resources for teaching it tend to be quite mathematically
intensive -- they really seem to be focused on a certain kind of ivory-tower-type audience,
so we're trying to create training and examples which are for non-machine-learning and math
experts, dealing with small data sets, giving raw models applications you can develop quickly.
Today we're going to create a real useful piece of deep-learning code in seven lines
of code.
We want to get to the point where it is easy for domain experts to work with deep-learning.
There are a lot of domain experts here -- whether you're working with getting satellites in
the air, or whether you're working with analyzing the results of chemical studies, or whether
you're analyzing fraud at a bank -- all those people are here in this audience.
You are domain experts that we want to enable to use deep-learning.
At this stage, the audience for this course is coders because that's as far as we think
we can get at this point.
We don't need you to be a math expert, but we do need you to be coders.
I know that all of you have been told of that prerequisite.
We do hope that with your help we can get to the point where non-coders will also be
able to participate in it.
The reason why we care about this is that there are problems like improving agricultural
yields in the developing world, or making medical diagnostics accessible to folks that
don't have them or so forth.
These are things that can be solved with deep learning.
But they are not going to be solved by people who are at these kind of more ivory tower
firms on the whole because they are not really that familiar with these problems.
The people who are familiar with these problems are the people who work with them every day.
So for example, I've had a lot to do with these kinds of people at the World Economic
Forum, I know people who are trying to help cure TB and malaria, I know people who are
trying to help with agricultural issues in the developing world and so forth.
These are all people who want to be using deep-learning for things like analyzing crop
imagery from satellites, or my most recent start-up, which was analyzing radiological
studies using deep-learning to deal with things like the fact that in the entire continent
of Africa there are only seven pediatric radiologists.
So most kids in Africa, in fact in most countries no kids have access to any radiologists and
have no access to any kind of modern image-based medical diagnostics.
So these are the reasons that we're creating and running this course.
We hope that the kind of feel with this community is going to be very different than the feel
that with deep-learning communities before, that have been all about "Let's trim 0.01%
off this academic benchmark."
This is going to be all about "Let's do shit that matters to people as quickly as possible."
[Time: 5 minute mark]
Sometimes to do that we're going to have to push the state-of-the-art of the research.
And where that happens, we won't be afraid to show you the state-of-the-art of the research.
The idea is that by the end of Part 1 of this, you will be able to use all of the current
best practices in the most important deep-learning applications.
If you stick around for Part 2, you'll be at the cutting edge of research in most of
the most important research areas.
So, we are not dumbing this down; we're just re-focusing it.
The reason why we're excited about this is that we have now the three pieces of this
universal learning machine.
We now have the three critical pieces -- an infinitely flexible function, all-purpose
parameter fitting, which is fast and scalable.
The neural network is the function.
We are going to learn exactly how neural networks work.
But the important thing about a neural network is that they are universal approximation machines.
There's a mathematical proof, the Universal Approximation Theorem, that we're going to
learn all about which tells us that this kind of mathematical function is capable of handling
any kind of problem we can throw at it.
Whether that mathematical function is "How do I translate English into Hungarian", or
whether that mathematical function is "How do I recognize pictures of cats", or whether
that mathematical function is "How do I identify unhealthy crops".
It can handle any of these things.
So with that mathematic function, then the second thing you need is some way to fit the
parameters of that function to your particular need.
And there's a very simple way to do that, called "gradient descent" and in particular,
something called "backwards propogation" or "back-prop" which we will learn all about
in this lesson and the next lesson.
The important thing is though that these two pieces together allow us to start with a function
that is in theory capable of doing everything and turn it into a function that is in practice
capable of doing whatever you want to do, as long as you have data that shows examples
of what you want to do.
The third piece, which has been missing until very recently, is being able to do this in
a way that actually works with the amount of data that you have in the time you have
available.
And this has all changed thanks particularly to GPUs.
So GPUs are Graphics Processing Units, also called "video cards" (that's kind of an older
term now), also called "graphics cards".
And these are devices inside your computer which were originally designed to play computer
games.
So its kind of like when you're looking at this alien from the left-hand side and there's
light coming from above, what pixel color do I need for each place.
That's basically a whole bunch of linear algebra operations, a whole bunch of matrix products.
It turns out that those are the same operations we need for deep-learning.
So because of the massive amount of money in the gaming industry that were thrown at
this problem, we now have incredibly cheap, incredibly powerful cards for figuring out
what aliens look like.
And we can now use these, therefore, to figure out medical diagnostics in Africa.
So, it's a nice, handy little side-effect.
GPUs are in all of your computers, but not all of your computers are suitable for deep-learning.
And the reason is that programming a GPU to do deep-learning really requires a particular
kind of GPU, and in practice at the moment, it really requires a GPU from Nvidia, because
Nvidia GPUs support a kind of programming called CUDA (which we will be learning about).
There are other GPUs that do support deep-learning, but they're a bit of a pain, they're not very
widely used.
And so one of the things that we're going to be doing is making sure that all of you
guys have access to an Nvidia GPU.
The good news is that in the last month (I think) Amazon has made available good-quality
Nvidia GPUs for everybody for the first time.
They call them very excitingly their P2 instances.
So I've spent the last month making sure that it's really easy to use these new P2 instances.
I've given you all access to a script to do that.
Unfortunately, we're still at the point where they don't trust people to use these correctly,
so you have to ask permission to use these P2 instances.
[Time: 10 minute mark]
The Data Institute folks, for anybody who does not have an AWS P2 instance or their
own GPU server, they are going to collect all of your AWS IDs, and they have a contact
at Amazon who will go through and get them all approved.
They haven't made any promises, they've just said they will do what they can.
They are aware of how urgent that is, so if you email your AWS ID to Mindy, she will get
that organized.
And we'll come back and look at AWS in more detail very shortly.
The other thing that I have done is on the wiki I have added some information about getting
set up, Installation.
There is actually quite an interesting option called OVH.
I'm sure by the time that this is a MOOC there will be a lot more, but this is the only company
I've come across who will give you a by-the-month server with decent deep-learning graphics
cards on it, and it's only $200.
To give you a sense of how crazily cheap that is, if you go to their page for GPU servers,
you'll see that this GTX970 is $195 per month and their next cheapest is $2000 a month.
It just so happens that this GTX970 is ridiculously cheap for how good it is at deep-learning.
The reason is that deep-learning uses single-precision arithmetic -- it uses less accurate arithmetic.
These higher-end cards are designed for things like fluid simulations, tracking nuclear bombs
and stuff like that, that require double-precision arithmetic.
So it turns out these GTX970s are only good for two things, games and deep-learning.
So the fact that you can get one of these things which has got two GTX970s in it is
a really good deal.
So one of the things you might consider doing in your team is maybe sharing the cost of
one of these things.
$200 per month is pretty good compared to worrying about starting and stopping your
90 cent per hour AWS instance, particularly if AWS takes a while to say yes.
How many of you people have used AWS before?
Maybe a third or a half.
AWS is Amazon Web Services.
I'm sure most of you, if not all of you, have heard of it.
It's basically Amazon making their entire back-end infrastructure available to everybody
else to use.
Rather than calling it a server, you get something they call an instance.
You can think of it as basically being the same thing.
It's a little computer that you get to use.
In fact, not necessarily little.
Some of their instances cost $14 or $15 an hour and give you like 8 or 16 graphics cards
and dozens of CPU and hundreds of gigabytes of RAM.
The cool thing about AWS is that you can do a lot of work on their free instance.
You can get a free instance called a T2.micro and
you can get things set up and working on a really small dataset and then you can switch
it across if you want to then run it on a big dataset, switch it across to one of these
expensive things and have it run and finish in an hour or two.
So that's one of the things that I really like about AWS.
Microsoft also has something a lot like AWS called Azure.
Unfortunately, their GPU instances are not yet publicly available.
I've reached out to Microsoft to see if we can get access to those as well, and I'll
let you know if we hear back from them.
One of the things that Rachel has done today is to start jotting down some of the common
problems that people have found with their AWS installs.
Getting AWS set up is a bit of a pain, so we've created a script that basically will
do everything for you.
But the nice thing is that this script is very easy for you to have a look at and see
what's going on, so over time you can kind of get a sense of how AWS works.
Behind the scenes, AWS is using their command-line interface, or CLI, which we've given you instructions
on how to install.
[Time: 15 minute mark]
As well as using the CLI, you can also go to console.aws.amazon.com and use this grapical
interface.
In general, I try to avoid using this graphical interface because everything takes so much
longer and it's so hard to get things to work repeatedly.
But it can be nice to kind of look around and see how things are put together.
Again, we're going to come back and see a lot more about how to use the graphical interface
here, as well as how to create and use scripts.
So these are some of the pieces that we want to show you.
I wanted to talk a bit more before we go into more detail about some of the interesting
things that we've seen happening in deep-learning recently.
And perhaps the thing that I've found most fascinating recently was when one of the leading
folks at Google Brain presented this at a conference at Stanford, which showed the use
of deep-learning at Google.
And you can see from this is just 2012 to today, or maybe two months ago, it's gone
from nothing to over 2500 projects.
Now the reason I find this interesting is that this is what is going to start happening
to every organization and every industry over the next few months and few years.
So they've kind of described how Google is getting used pretty much everywhere and you
can imagine probably if they redid this now, two months later, it's probably going to be
somewhere up here.
So we've kind of felt that it would be great to kind of kick-start lots of other organizations
to start going up this ramp.
That's another kind of reason we're doing this.
I really like looking at applicatons and we started seeing some examples of some kind
of deep-learning amateurs applications -- this is an example of it.
What these guys did is (they're not machine-learning or deep-learning experts) - they downloaded
a copy of Cafe, they ran a pre-existing model.
This is what we're going to learn to do today.
We're going to run a pre-existing model and use the features from that model to do something
interesting.
In their case, the thing that they were doing that was interesting was to take data that
they already had, because they are skin lesion people and analyze skin lesions.
These are the different kind of skin lesions that you can have.
They found, for example, that the previous best for finding this particular kind of skin
lesion was 15.6% accuracy.
When they did this off-the-shelf Cafe pre-existing model with a simple linear thing on top, they
quadrupled it to 60%.
Often when you take a deep-learning model and use the very simple techniques we'll learn
today, you can get extraordinary optics compared to non-deep-learning approaches.
Another example of that was looking at plant diseases,there have been at least two groups
that have done this in the last few months.
Again, very successful results from people who are not deep-learning or machine-learning
experts.
Similar results in radio modulation.
These folks who are electrical engineering people found that they could double the effective
coverage area of phone networks (this is a massive result), and again they used very
simple approaches.
It's being used in fashion, it's being used to diagnose heart disease, and by hedge-fund
analysts.
There's a particular post which I found really inspiring actually in trying to put this together,
which is that Keras (which is the main library we'll be using), the author of that put together
this post showing how to build powerful models using very little data.
I really just want to give a shout-out to this and say that this work that Francois
has been doing has been very important in a lot of the stuff we're going to be learning
over the next few classes.
[Time: 20 minute mark]
The basic environment that we're going to be working in most of the time is the ipython
notebook or the jupyter notebook.
Let me just give you a sense of what's going on here.
When you have a jupyter notebook open, you will something which ...
This is a good time to show you about starting and stopping AWS instances.
So I just tried to start going to my notebook on AWS and it says it can't be reached.
So my guess is if we go back to my console you can see that I have zero running instances
- I have zero servers currently running.
So if I click that, I will see all my servers.
Normally I would have one P2 server (or instance) and one T2, because I use the free one for
getting everything set up and then use the paid one once everything's working.
Because I've been fiddling around with things for this class, I just have the P2 at the
moment.
So, having gone here, one way I could start this is by selecting Start here, but I don't
much like using this GUI for stuff because it's so much easier to do things through the
commandline.
So one of things that I showed you guys that you could download today is a bunch of aliases
that you can use for starting and stopping AWS really quick.
If you haven't got them yet, you can find links to them on Slack, or you can just go
to platform.ai/files and there's a bunch of different things here.
This aws-alias.sh is a file that sets up these various aliases.
The easiest way to grab stuff on your AWS instance or server is to use wget, so I would
right-click on this and choose CopyLinkAddress, and then go wget and paste in that and that
will go ahead and download that file (I already had one, so it created a copy of it).
We can take a look at that file, and you'll see that it's basically a bunch of lines that
say "alias something = somethingElse".
And it's created aws-get-p2, aws-get-t2, aws-start, aws-ssh, aws-stop.
I'm going to show you what these things do because I find them pretty convenient.
First of all, I'll say "source aws-alias.sh" and that just runs this file (in bash, that's
how you run a file).
That's now caused all of those names to appear as aliases to my system.
So if I now run aws-get-p2, that's going to go ahead and ask Amazon for the ID of my P2
instance.
And not only does it print it, but it's going to save it into a variable called "instanceId"
and all of my other scripts will use $instanceId.
So I now want to start that instance, so I just type aws-start and that's going to do
the equivalent thing of going to the GUI, right-clicking, choosing Start.
The other nice thing it does is it waits until the instance is running and at the end it
asks, or querries for, the IP address and prints it out.
Now the script that I have given you to set up these instances actually uses an elastic
IP that actually keeps the same IP address every time you run it.
So you should find that the IP address stays the same, which makes it easier, so there
is the IP.
So I then have something called aws-ssh, and aws-ssh will go ahead and ssh into that instance
(ssh ubuntu@$instanceIp).
So all it does is basically use the username "ubuntu" (because that's the default username
for this kind of image on AWS) @$instanceIp (that's the IP address we just got).
[Time: 25 minute mark]
The other thing it does is to use the private key that was created when this was originally
set up.
Now in my case, I've actually moved that private key to be my default key so I don't actually
need that -Ip.
But you can just type aws-ssh and you'll see, bang, here we are.
We are now inside that AWS image.
One of the handy things about AWS is they have this thing called AMIs, Amazon Machine
Images.
An AMI is basically a snapshot of a computer at a particular point in time.
And you can start your own instance using a copy of that snapshot.
So in the script I've given you guys I've created and provided an AMI which has all
the stuff we want installed.
So that's why it is when you use that script and log in to it, you can start running things
straight-away.
So let's do that right now.
I've already created a directory for you called "nbs", for notebooks.
So we can go ahead and type "jupyter notebook", and this is how we ask Amazon to set up a
jupyter notebook server for us.
When it's done, it says "The Jupyter Notebook is running at: http://[all ip addresses on
your system]:8888".
So what is our IP address?
Well, it told us up here when we started it, 52.40.116.111.
So I'm going to go to my instance, 52.40.116.111:8888 (it told me that the port is 8888), and press
Enter.
I've set up a password, it is just "dl_course".
We can look later on at how to change that password but I though it would be handy to
have a password there for everybody if you want to start looking at your own data.
Actually, by default it is not going to show you anything.
So now we can just go ahead and say New->Notebook and choose Python[condaRoot], and this sets
up a scientific computing environment for you, where you can type Python commands and
get back responses.
The basic idea here is that over there on Amazon, you have your server, and it's running
a program called jupyter notebook.
Jupyter notebook is causing a particular port (which is 8888) to be opened on that server,
where if you access it, it then gives you access to this jupyter notebook environment.
In your team, you guys can all use the same jupyter notebook if you want to.
Or you could run multiple jupyter notebooks on one machine.
It is really pretty flexible.
So now that I've created one, I can rename this to say, "jeremy's nb".
And so then Rachel might come along and be like I want to run something as well, so she
goes New on her computer and it creates a whole new one over here.
And she could say File->Rename and call it "rachel's notebook".
If I now go back here, you can see both of these notebooks are shown to be running.
So the server is running multiple kernels, they're called.
And you can see back here it's saying, "Creating new notebook" ... "Kernel started ..." . So
each of those are totally separate.
So from one of them, I say "name=rachel" and in the other one I say "name=jeremy" and over
here I say "name", you'll see that they are not in any way talking to each other, they
are totally separate.
So that's a super-handy way to do work and the other nice thing is that you can not just
type code, but you can also type Markdown.
[Time: 30 minute mark]
So I could go, "New Section", "I want to talk about something here".
And so that as I do that, it allows me to mix-and-match information and code.
And every piece of code that came out, I can see where it came from.
And, also as you'll see, it allows us to put in visualizations and plots and so forth.
Some of you may have come across this important concept called Literate Programming.
And Literate Programming is the idea that as you code, you are documenting what you
are doing in a very deep way, not just for others, but maybe more importantly for yourself.
So when you're doing data science work, work like a scientist.
How many people here are in some form scientists, or have been scientists?
So you guys will know the importance of your journal notebook.
The greatest scientists, there are all sorts of stories about the kinds of notebooks they
kept and how their lab notebooks or their lab journals worked.
This is critical for data scientists too.
The idea that as you do experiments, you're keeping track -- what did I do, what worked,
what didn't work.
I can see all the people who put their hands up as scientists are all nodding right now.
So this makes it super-easy to do that.
So be helpful to yourself and to your team by taking advantage of this.
Now in order to learn to use this environment, all you have to do is press H. And when you
press H, it brings up all these keyboard shortcuts.
After not very long, you will get to know all of them, because they are all extremely
useful.
But the main ones I find particularly helpful, is you hit M to turn into Markdown mode (that's
the mode where you can enter text rather than code), or Y to switch it back to code again.
And you certainly need to know [SHIFT]Enter, which evaluates the cell and gives you a new
cell to enter into, and you also need to know Escape, which pops you out of entering information
and gets you back into this command mode, and then Enter to get back into enter mode
again.
And you see as I move around, it changes which one is highlighted.
I've started to create some resources on the wiki to help you with jupyter notebook.
It's still really early, but you guys I'm sure can help by adding more information here.
One of the things I particularly mention is that there are some good tutorials.
I thought I had also mentioned my favorite book, "Python for Data Analysis" by Wes McKinney.
It's a little old, it also covers Pandas a lot (which you don't need), but it's a good
book for getting familiar with this basic kind of Python scientific programming.
The last kind of ingredient I want to introduce is Kaggle.
How many people have been to or have done anything with Kaggle at any point?
Anybody who is in the masters program here I'm sure will have used Kaggle or will shortly
use Kaggle.
Mainly because it's a great place to get all kinds of interesting data sets.
So for example, if you wanted to test your ability to automate drug discovery, you could
go to Kaggle and download the files for the Merck Molecular Activity Challenge, run some
models and test them to see how they compare to the state-of-the-art by comparing them
to the leader board.
Kaggle is a place where various organization run machine-learning competitions, they generally
run for about three months.
It's super-cool because they get archived, potentially forever.
You can download the data for them later on and find out how you would have done in that
competition.
[Time: 35 minute mark]
Generally speaking, if you're in the top 50% that means you have an okay-ish model that
is somewhat worthwhile.
If you're in the top 20%, it means that you have a very good model.
If you're in the top 10%, it means you're at the expert level for these type of problems.
If you're in the top 10, it literally means you're one of the best in the world because
every time I've seen a Kaggle competition (I used to be president of Kaggle, so I'm
very familiar with it) at least the top 10 generally all beat the previous best in the
world and generally are very good machine-learning experts who are going beyond anything that's
been done before.
It seems that the power of competition pushes people way beyond what the previous academic
state-of-the-art was.
So Kaggle is a great environment to find interesting data sets and to benchmark your approaches.
So we're going to be using it for both of these purposes.
Our first challenge will be Dogs vs Cats.
Sometimes on Kaggle they run competitions that are not done for lots of money, sometimes
they are done for free or for a bit of fun.
In this case, it was actually done for a particular purpose which was can you create an algorithm
that recognizes the difference between dog photos and cat photos.
The reason why was that this particular organization was using that problem as a CAPTCHA, in other
words, to tell the difference between humans and computers.
It turned out that the state-of-the-art machine classifiers could score 80% accuracy on this
task.
So really this group wanted to know can you surpass the state-of-the-art, is this a useful
CAPTCHA.
And then if you can surpass the state-of-the-art, can they use this in a dogs vs cats recognizer
for their pet finding work.
So really the goal here was to beat 80%.
Now this is a great example of the kind of thing which you could use for a thousand million
different purposes.
For example, the work I did in cancer detection is this -- if you take a CT or an x-ray or
an MRI and you say to a deep-learning algorithm, these people have malignant cancer, these
people don't, then it's the same as cats vs dogs.
If this is a healthy, high crop-yield area from satellite photos, this area is not, then
it's the same as cats vs dogs.
If you say, this is one kind of skin lesion and this is another kind of skin lesion; if
you say that this is an abstract art painting and this is not; this is an extremely valuable
painting and this is not; this is a well-taken photo and this is not.
They're all image analysis problems that are generally classification problems, and these
are all examples of things that people have done with this kind of technology.
So cats vs dogs, it turns out, is a very powerful format and so if we can learn to solve this
well, we can solve all of these kinds of classification problems.
Not just binary, not just this group or that group, but also things like that skin lesion
example, these are 10 different types of skin lesions, which type is it.
Or the crop disease example, which of these 13 crop diseases are we looking at here.
An example of an actual thing that I saw was cucumber analysis.
A Japanese cucumber farmer used this approach to deep-learning, he automated all the logistics
and had a system that would put different grades of cucumbers into different bins automatically
and make the cucumber workflow much more efficient.
So, if that was your idea for a start-up, it's already been done, sorry ... but there's
many more.
There are all of our basic pieces.
To get started, here we are with this AWS server, with this pretty empty looking set
of notebooks here, so we want to go ahead and start getting some work done.
To do that, we need to download the basic files that we need.
[Time: 40 minute mark]
So I've sent you all of this information already, all of the information you need is on our
platform.ai website.
All of the notebooks are in files/nb.
I'm going to show you a cool little trick, I'm going to press [CNTL]C twice, that shuts
down the notebook, the notebook's not running.
Don't worry, it saves itself automatically on a regular basis, or you could just hit
S to save it right now.
After shutting down the notebook, as you'll see, the Python notebook files are still there.
And you can see actualy that behind the scenes, they're just big bunches
of JSON text, so you can stick them in Github and they'll work pretty well.
What I generally like to do is run something called tmux.
How many of you here have used tmux or screen before?
Those of you who haven't, you're going to love this trick.
Tmux and screen are programs that let you run programs on your server, close your terminal,
come back later and your program will still be running in the exact same way.
I don't remember if tmux is already installed; it is.
To use it, you just go tmux, and it looks like nothing happened, except a little green
bar here at the bottom.
But if I now hit tmux's magic command, which is [CNTL]B, and press [CNTL]B?
(control-B-?), you can see there are lots of keystrokes that tmux has ready for me to
use.
And so one of the ones I like is [CNTL]B" (control-B-doubleQuote), which creates a second
window underneath this one, or [CNTL]B% (control-B-percent), which creates a second window next to this
one.
I seem to like to set up a little tmux session and get it all set up the way I want.
So I'm not going to go into detail about how to do everything I show you, what I really
want to do in the class to make the most of the time, is say here's something that exists,
here's something I recommend you using, here's what it's called, and during the week, you
can play with it.
You can ask questions, you can use it in your team, and so forth.
So here it is, it's called tmux, this is what it does, and I'll show you something cool.
If I now go [CNTL]B and then d for detach, and close out of this altogether, it's all
gone.
So if I now go back into my server, it's all gone.
So if I now go back into my server ... I wasn't able to return to my session properly because
currently $instance Ip is not defined.
Rather than every time I start, sourcing my aws-alias.sh file, what I should do is go
"vim .bashrc" (.bashrc is a file that is run every time you run bash), and if I edit my
.bashrc file and at the end
I type "source aws-alias.sh", you can see now all the aliaseses are there.
So before I ssh to $instanceId, I have to find out my correct IP address.
So I can say aws-get-p2 to get my instance ID.
I'm not sure I have something here to just get the IP address.
As you can see, I'm kind of playing with this as I go.
So I'm going to go ahead and show you how to do this.
Right now, the IP address only gets printed out when I start an instance.
[TimeP: 45 minute mark]
In this case, I've already got an instance running.
I'm going to edit this script and I'll change it later on.
But basically I'm going to create a new alias called aws-ip, and I just going to keep the
bit that says instanceIp = somethingSomethingSomething.
I then source aws-alias.sh, and I've now got a new alias called aws-ip, and now I can go
ssh ubuntu@$instanceIp.
Having said all that, because my IP address is going to be the same every time and I couldn't
really be bothered waiting for all that, I'm actually going to manually going to put my
IP address in here, so that the next time I run this I can just press upArrow and rerun
the command.
I'm kind of showing you lots of ways of doing things so that you can kind of decide what
your own workflow is like, or come up with better ones.
So here's a cool thing, I am back in my box here, and then if I say "tmux attach", I am
exactly back to where I came from.
Whatever I had running, whatever state it was, it is still sitting there.
The particularly cool thing is that any notebooks, the kernels I had running, they are all still
sitting there.
This is particularly helpful if you are running an OVH server, or one of your own servers.
With AWS, it is a little less helpful because you really need to shut it down to avoid paying
the money.
But if you've got something you can keep running.
For all the USF students, you all have or will have access to the GPU server we have
here at the University, particularly helpful for you guys.
So I actually tend to use this little bottom right hand window to permanently have jupyter
notebook running and I tend to use this left hand window to do other things.
In particular I am going to go and grab my notebook.
The easiest way to grab things is to use wget, and if I go "wget http://www.platform.ai/files/nbs/lesson1.ipynb",
I now have a notebook, lesson1 notebook.
And so if I go back to my jupyter notebooks, it is here and if I click on it and here is
our notebook.
If you're using a T2 instance (the free one) generally speaking particularly the first
time you run something, it could take quite a long time to open.
You should find the second time is fast, by the way.
So here is our notebook.
Hopefully quite a few of you have already gotten to the point today that you can see
this.
Those of you that haven't will get plenty of help during the week.
This particular notebook uses two external scripts to help.
Those scripts are called utils and vgg16.
The last thing to do before our break is to grab those (wget), just toss those all in
the notebook directory so they're all in the same place.
Then unzip them.
Then the only other thing that you need is the data.
The data sits in the platform.ai/data directory.
The data is all the dogs and cats.
Now I've the Kaggle data and made some changes to it, which I'm going to be showing you.
So rather than downloading it from Kaggle, I suggest you grab it from platform.ai, and
I've sent you this information today as well.
So I'm going to cd into the data directory and wget dogscats.zip as well.
So that's going to run for a few minutes.
[Time: 50 minute mark]
The previous section for some of you was a bit of a fire hose of information, here's
bash, here's AWS, here's Kaggle, here's GPUs.
And for some of you it was probably really boring, most practicing
data scientists probably are using all of these things already.
If you're at one extreme (holy shit that was a fire hose of information), don't worry,
we have all week to get through it.
We'll have the video tomorrow.
And by the time that you're here again next week, I want to make sure that everybody who
has the time and interest to work hard on it has got through all the material.
If you haven't, like it's early in the weekend and you're not going to get there, please
let Rachel and I know.
We will work with you in person to get you there.
Everyone who puts the time in, I'm determined to make sure can get through the material.
If you don't really have the background and you don't really have the time, that's fine.
Maybe you won't get through all the material.
But I really am determnied that everybody who's prepared and able to put in the time
can get through everything.
So between the community resources, and the video, and Rachel and I, we will help everybody.
To those of you who are practicing data scientists and you are familiar with all of these pieces,
I apologize that it will be a bit slow for you and hopefully as we move along there will
be more and more new stuff.
I'm kind of hoping that for those of you that have some level of expertise, we will continually
give you ways that you can go further.
So for example, at the moment, I'm thinking, can you help us with these scripts, to make
them better, to make them simpler, to make them more powerful, to create Azure versions
of them.
All this stuff that we're doing to try and make deep-learning as accessible as possible,
can you help contribute to that, can you contribute to the wiki.
So for those of you that already have a high level of expertise, I'm looking to make sure
there's always ways to push yourself.
So if you're ever feeling a bit bored, let me know and I'll try to give you something
to do that you don't know how to do, and then you won't be bored anymore.
So at this point, I downloaded dogscats.zip and I unzipped it (unzip -q dogcats.zip).
If you are wondering about the "-q", it is just because otherwise unzip prints out every
filename as it goes, so "q" is for quiet.
So just about the most important thing for doing this kind of image classification is
how the data directories are structured.
In particular, you'll notice that we have a training set, and a test set.
That's because when we downloaded the data originally from Kaggle, it had a train.zip
and a test.zip.
Keras, which is the library we're going to use, expects that each class of objects that
you're going to recognize is in a different directory.
So the one main thing I did after I downloaded it from Kaggle is that I created two directories,
one called cats and one called dogs, put all the cats in the cats directory and all the
dogs in the dogs directory.
When I downloaded them from Kaggle, they were all in one directory and they were called
cat.1.jpg and dog.1.jpg. (ls train/dogs/dog.1*) There are 11500 dogs in there and 11500 cats.
So now if I "ls -l train/dogs/ | wc -l; ls -l train/cats/ | wc -l" there are 11,500 dogs
in there and 11,500 cats in there, so that's the number of dogs and cats we have in our
training set.
[Time: 55 minute mark]
So for those of you that haven't done much data science before, there's this really key
concept that there's a training set and a test set.
Kaggle, being a competition, makes this really obvious, the files in the training set tell
you what they are; here is a dog, it's called dog.something.
But if I look in the test set, they don't say anything, they are just numbers.
Why is that?
That's because your job in the Kaggle competition is to say for example, file 43.jpg - is it
a dog or is it a cat?
So there are 12500 images in the test directory for you to score, for you to classify.
Even if you're not doing a Kaggle competition, you should always do this yourself.
Ideally, you should get one of your colleagues to do it without you being involved.
Split the data into a test set and a training set and to not let you look at the test set
until you've promised you're finished.
Kaggle kind of enforces this.
They let you submit to the leader board and find out how you're going but the final score
is given based on a totally separate set of data that is not scored.
So this is like for me, before I started entering Kaggle competitions, I thought my data science
process was reasonably rigorous, but once I really started doing competitions I realized
that that level of enforcing the test/training data set made me a much better data scientist,
you know you can't cheat.
I do suggest that you do this in your own projects as well.
Now because we also want tune our algorithm, in terms of different architectures and different
parameters and so forth, I'd like to talk about, it's also a good idea to split your
training set further, into a training set and a validation set.
You'll see a lot more about how this works.
But you'll see in this case, I've created another directory, called "valid", which has
dogs and cats subdirectories as well.
Let's check that they're exactly the same.
Here you can see that there are 1000 cats and 1000 dogs.
So when I originally downloaded from Kaggle, there were 12500 cats and dogs in the training
set.
That's why in my training set there are 11500 because I moved 1000 of each of them to the
validation set.
So that's the basic data structure we have.
Other than splitting things into test, training, and validation sets (that's the most important
advice I have for data scientists), the second most important piece of advice I have for
data scientists is to do nearly all of your work on a sample.
A sample is a very small amount of data that you can run so quickly that everything you
try, you get a nearly immediate answer to.
This allows you to very quickly try things, change things, and you get a basic process
running.
So I always create a sample with 100 or so items to just get started with.
So you'll see I have a directory called sample, and in that I have a whole separate train
and valid.
I did not move things there, I copied them there.
The purpose of this sample directory is to just let me do things really quickly.
So, you'll see inside sample/train, again I have cats and dogs directories, but this
time there ae 8 files in each directory.
I probably should have put more in there.
I think more like 100 may have been good, but I think at the time I was using a really
low power computer to do my testing, just enough to check that my script's working.
Now that everything's downloaded, you can see that I have my jupyter notebook has already
loaded.
I'll get rid of the zip files and the notebooks that I was just playing with, and we're ready
to get started doing some deep learning.
[Time: 1 hour mark]
The goal for you guys during this week will be to replicate everything that I've done,
initially just by making sure that this notebook works for you, but then to replicate it with
another data set.
One of the things we'll do tomorrow is post some ideas of other interesting Kaggle data
sets you can try, and maybe other people can also post other interesting data sets they
found elsewhere.
The idea will be to make sure that during the week, you can run your own classification
process on some data set other than dogs and cats.
But first of all, make sure you can run this.
So as you can see in this notebook, I've used Markdown cells.
How many people have used Mardown before?
So, most of you.
For those of you that don't know, Markdown is what we use both in the notebook, as well
as on the wiki.
It's basically a way of really quickly creating formatted text.
There's not enough of you that aren't familiar with it that I'm going to go over it in detail.
If you're not familiar with it, please google Markdown, and you can experiment with it either
on the wiki or in your notebook.
As you can see though, I've basically created cells with headings and some text.
During the week, you can read through things in detail.
As we mentioned, we're going to try to enter the dogs and cats competition.
So 25,000 labeled dog and cat photos half of each, 12,500 in the test set, and the goal
is to beat 80%.
As we go along, we are going to be learning about quite a few libraries.
Not too many, but enough that for those of you that haven't used Python for data science
before, it's going to seem like quite a bit.
By the end of the seven weeks, hopefully you'll be pretty familiar with all of them.
One of the really important three is matplotlib.
Matplotlib does all of our plotting and visualization.
And on the wiki, we have a section called Python Libraries, and as you can see we have
our top three listed up here.
At the moment, there are just links to where they come from.
I'm hoping that you guys will help us to turn this into a really rich source of information,
about places that you've found lots of helpful stuff, answers to questions.
But for now, if you're not familiar with one of these things, type the word followed by
"tutorial" into google and you'll find lots of resources.
All of these things are widely used, Keras a little bit less so because it's just a deep-learning
library and therefore relatively new.
Numpy and matplotlib and scikit-learn and scipy -- there's lots of books about them,
there's lots of tutorials about them.
Matplotlib creates plots and one of the things we need to do is to tell jupyter notebook
what to do with these plots.
Should it pop up a new window for them, should it save them?
So this "%matplotlib inline" says please show our plots in the actual jupyter notebook.
That's pretty much the first line in every jupyter notebook right here.
And here's the thing I told you about, which is sometimes I want to run stuff on a sample,
sometimes I want to run it on everything.
So I make it really easy for myself by having a single thing called "path" which I can switch
between the sample and the everything.
So for now, lets just do things on the sample, do all of your work on the sample until everything
is working.
As you can see, each time I've done something, I've pressed [SHIFT][ENTER] and it says a
particular number after the In (In [2]:), so this is the second input cell that I've
run.
Like every programming language, a large amount of the power of Python comes from the libraries
that you use.
To use a library in Python, you have to do two things: you have to install it and then
you have to import it.
In Python, I strongly recommend that you use a particular Python distribution called Anaconda.
And if you're using the scripts and the AMIs we provided, you're already using Anaconda.
[Time: 1.05 hour mark]
You can check which Python you're using by typing "which python" and it will show you.
You'll see that I'm not only using Anaconda, but I'm using an Anaconda that was installed
into my home directory.
So no screwing around with sudo or any of that business.
If you use our AMI scripts, this is all being done for you.
With Anaconda, installing anything is as simple as typing "conda install" and the name of
the package.
And on Anaconda, everything's been precompiled, so you don't have to wait for it to compile,
you don't have to worry about dependencies, you don't have to worry about anything, it
just works.
That is why we very highly recommend using Anaconda.
It works on Mac, it works on Windows, and it works on Linux.
Lots of Windows users use it, very few Linux users use it, very few Mac users use it.
I think that's a mistake because lots of Linux and Mac users also have trouble with compiling
dependencies and all that stuff.
I suggest that everybody use it.
From time-to-time, you'll come across something that does not have a conda installer available,
in case you'll have to use pip instead.
In our case, I think just Theano and Keras are in that situation, but neither of those
need compiling anything at all, so they're very, very easy to install.
So, once you've installed it by typing conda install whatever (and most things are already
installed for you with our AMI), you then have to tell Python that I want to use it
in this particular session, which you do by typing import and the thing you want to look
at.
So I'm not going to go through all these libraries right now (I'll go through them as we use
them), but one of the big three is here, which is numpy.
Numpy is the thing which our wiki page describes that provides all of our linear algebra.
How many people here have some familiarity at all with linear algebra?
Nearly all of you, good.
So, if you're somebody who didn't put up their hand, I would suggest looking at the resources
that Tara added.
Go back to the homepage and go to Linear Algebra for Deep Learning.
Generally speaking for any math stuff, my suggestion is to go to the Kahn Academy site.
The Kahn Academy has really great videos for introducing these kind of simple topics.
We just need to know these three things, mainly just these first two things (matrix product,
matrix inverse) for this course.
Numpy is the thing that gives you these linear algebra operations in Python, and as you'll
see it makes them extremely easy to use.
Pretty much everybody renames numpy to np, thats what "import numpy as np" does.
You'll find in nearly everybody's script on the Internet, it will be np.something.
In general, we try to stick with the same kind of approaches that everybody else uses,
so that nothing will be too unfamiliar.
Okay, so we've imported the libraries that we need.
We also try to provide additional utilities and scripts for things we think out to exist
but don't exist, to make things easier.
There's very few of them.
Nearly all of them are in a little script called utils.
There's a cool little trick, if you are using an external script that you've created and
you're changing it quite a bit.
For example, now that you've got utils, feel free to add and change and do what you like
to it.
If you import it like this, "import utils; reload(utils); from utils import plots", you
can go back and rerun that cell later after you've changed utils.py, and all of your changes
will be there, available for you to use.
For now, we're just going to use one thing from our utils library, plots.
So, our first step will be to use a pre-trained model.
What do we mean by a pre-trained model?
What we mean is that somebody has already come along and downloaded millions of images
off the Internet and built a deep-learning model that has learned to recognize the contents
of those images.
Nearly always when people create these pre-trained models they use a particular dataset, called
ImageNet.
One of the key reasons that they tend to use ImageNet is because ImageNet has the most
respected annual computer vision competition.
Nowadays people that win the ImageNet challenge tend to be companies like Google
and Microsoft.
A couple of years ago, it tended to be people who immediately got hired by Google and Microsoft.
[Time: 1.10 hour mark]
ImageNet itself is fun to explore.
If you go to ImageNet and go to Explore (image-net.org/explore), you can check it out.
Basically, there are 32,000 categories.
So, for example, you could go to ImagNet and look at plant->crop->field-crop->field-corn->dent-corn.
So here we have a number of pictures of dent corn; there are 397 of them.
The folks that create these pre-trained networks basically download a large subset of ImageNet,
the competition has 1000 of these 32,000 categories that people compete on.
So nearly always people just build models for these 1000.
I would be remiss if I did not mention the shortcomings of the ImageNet dataset.
Can anybody tell me something that they notice in common about what these photos look like
or how they are structured?
They're just one thing.
Like if you look at an arbitrary photo from my photo album, you'll see there's a person
here and a bridge there, and something else here.
ImageNet is carefully curated, for flint corn there are 312 really good pictures of flint
corn, whatever that is.
This is an easier problem than may problems that you will be facing.
For example, I was talking to Robin from Planet Labs at the break about the work that they're
doing with satellite imagery.
Their satellite imagery is going to have a lot more than just a piece of corn.
Planet Labs photos are pretty big, a couple million pixels, you're
going to have 500 sq km.
So there's going to be tennis courts and swimming pools, and people sunbathing and all kinds
of stuff.
So when Robin takes this stuff to Planet Labs, he's not going to be able to use a pre-trained
network directly.
But we're going to show you how you can use some of the structure of the pre-trained network
even if you are not looking at photos that are this clear.
Having said that, if you remember the slide I showed you earlier of the plant disease
project, each of those plant disease pictures were very clearly just pictures of one thing.
Be aware that when you're using a pre-trained network, you are inheriting the shortcomings
and biases of the data is was trained from, and therefore you should always look at the
data it was trained from.
Being aware of that, I would say for us this is going to be a very suitable dataset and
when we look at the dataset, you'll see why I say that.
So each year, most of the winners of the Imagenet competition make their source code and their
weights available.
So when I say their source code and their weights, the source code is the thing that
defines ... remember when I told you there were three bits that give us modern deep learning
... infinitely flexible function, way to train parameters, fast and scalable.
The particular functional form is what is the neural net architecture.
So that's the source code.
So generally you download the source code from the folks that built it all.
So the second thing is the parameters that were learned.
Generally an ImageNet winner has trained the model for days or weeks, nowadays often on
many GPUs, to find the particular set of parameters, set of weights, that make it really good at
recognizing ImageNet pictures.
So you generally have to get the code and the weights.
And once you have those two things, you can replicate that particular ImageNet winner's
results.
[Time: 1.15 hour mark]
One of the winners of 2014 was the Visual Geometry Group, an Oxford University group,
with a model called VGG.
You'll hear about it lots.
Generally speaking, every year's ImageNet's winners, the particular model they used are
so well-used in the community that people call them by name.
Like the 2012 winner was AlexNet, the 2014 winner was VGG.
The 2015 was conception, the 2016 was ResNeXt, so they all have names.
VGG is a couple of years old, so it's not quite the best today, but it's special because
it's the last of the really powerful simple architectures.
We will get to the more complex architectures.
Depending on how we go, it might be in this set of classes.
If not, it will be in next year's classes.
Hopefully this year's set of classes.
VGG's simpler approach is not much less accurate and for teaching purposes, we're going to
be looking at something that is pretty state-of-the-art and is really easy to understand, so that's
one of the reasons we're using VGG.
Another reason we're using VGG is it's excellent for the kinds of problems we were just talking
about that Robin with his satellite imagery has, which it's a great network for changing
so that it works for your problem, even if your problem's a little different.
So there's a number of reasons that VGG is a really great thing for us to be using.
My strong preference is to start out by showing you how to do things that you can use tomorrow,
rather than starting with 1+1 and showing you how to do things that are useful in six
years time after you've got your PhD.
So, I'm going to start out by showing you 7 lines of code that do everything you need.
And to get to the punchline, the state-of-the-art for dogs vs cats in academia is 80% accuracy,
this gives you 90% accuracy, and you don't need to do anything else.
For you, after this class to see if you can get everything working, basically your job
will be can you run these 7 lines of code.
And if you can, you can re-run it on your own dataset as long as you structure the directories
the way that I just showed you.
So what I'm going to do is I'm going to go through these 7 lines of code (or something
very similar to them) line by line and show you pictures of what we're doing along the
way.
I wanted to start by showing you these 7 lines of code because we're going to be looking
at all kinds of things along the way in order to really understand what's going on, and
at some point you might start thinking "Gosh, there's a lot to do to do deep-learning."
But there's not.
There's a lot to do to really explain, and talk about and think about deep-learning,
but for you to actually do image classification, you just need these 7 lines of code.
So what does it mean to train a model that's already trained?
Yes, you're getting a little bit ahead of us, but it's great to answer these questions
many times.
In this case, the VGG model has been trained to recognize photos of the 1000 types that
are in the ImageNet competition.
There's a number of reasons why that does not give us dogs vs cats.
Reason #1, is that if we go into the animals secrtion of ImageNet -
dogs -> hunting dogs, sporting dogs, pointers, Vizsla.
They have 2334 pictures of Vizsla.
You could go back and run it and find all the German Short Pointers and Vizsla, but
that's something you have to do.
So that's one shortcoming of the VGG approach compared to what we actually do.
[Time: 1.20 hour mark]
The second shortcoming is that sometimes it's going to get it wrong, and it might get it
wrong for very good reasons.
For example, this one might come back with snow.
But it's not going to come back with just snow, it's going to come back with a probability
for every one of the 1000 categories.
It could be a probability of 0.0003 that it's a mushroom, and 0.0002 that it's an airport,
and 0.4 that it's snow, and 0.3 that it's a German Shephard.
We want to kind of take advantage of all that information as well.
So what this actually does, it does something called fine-tuning, something we're going
to learn a lot about.
Fine-tuning takes that pre-trained image model and it says, use everything you know about
the 1000 categories to figure out which one of the cats.
That's a great question and we're going to go back and talk about that a second time.
So this code can work for any image recognition task with any number of categories, regardless
of whether it's in Imagenet or not.
Really the only kind of image processing/recognition that they're not going to do, is something
where you're going to recognize lots of objects.
This is specifically for recognizing a class.
Let's see how it works.
When something's running, it has a little star.
You will probably get this warning, cuDNN is more recent than the one Theano officially
supports.
So this is a good time to talk about some of the layers that we have going on.
In this example, we're using our vgg16 plus.
If it's sitting on top of Keras, which is the main deep-learning library we're using.
Keras is sitting on top of Theano (which we'll be talking about quite a bit, but less than
Keras).
Theano is the thing that takes Python code and turns it into compiled GPU code.
Theano is sitting on top of a number of things, broadly speaking Nvidia's CUDA programming
environment.
Part of CUDA is the CUDA deep-learning neural-network libray (cuDNN).
For most important things in deep-learning, Theano is simply calling a function inside
cuDNN.
So one of the things that we've set up for you in the scripts is to get all of this stuff
stuck together.
Keras is all written in pure Python and what it does is it takes your deep-learning architectures
and code and turns it into Theano code (in our case).
It can also turn it into TensorFlow code.
TensorFlow and Theano are very similar.
They're both libraries that sit on top of CUDA and provide a type of a Python to GPU
mapping and lots of libraries on top of that.
TensorFlow comes out of Google and it is particularly good at things that Google really cares about,
in particular running things on lots and lots of GPUs.
One of the things you will hear a lot is that you can't do anything with deep-learning unless
you have shitloads of data and shitloads of GPUs.
That is totally, totally wrong, as you'll see throughout this course.
It is true that if you want to win ImageNet next year, you'll need lots and lots of GPUs
because you'll be competing for that last 0.1% against Google, against Microsoft.
[Time: 1.25 hour mark]
However, if you're trying to recognize 10 different skin lesions (like the folks I just
showed you were), they were the first people to try to do that with deep-learning and they
quadrupled the previous state-of-the-art using 1 GPU and a very small amount of data that
they had hand-collected.
So the reason you see a lot of stuff about a lot of GPUs and a lot of a data is because
it's part of the thing to try to make neural networks cool, rather than uncool.
Trying to make it exclusive rather than inclusive.
It's like unless you're us, you're not in the club.
And I really don't want you to go into that thing.
You will find again and again that it's not true.
As I've just shown you, with 7 lines of code, you can turn the state-of-the-art 20% error
rate into a 3% error rate, and it takes about 5 minutes to run on a single GPU which costs
90 cents per hour.
So I am not going to be talking much about TensorFlow in this course because it's still
very early, it's still very new.
It does some cool things, but not the kind of cool things that uncool things have access
to.
Theano, on the other hand, has been around quite a lot longer.
It's much easier to use.
It does not do multi-GPUs well, but it does everything else well.
If you build something in Keras and you get to a point where everything is great and have
a 400% improvement in the state-of-the-art, I want the extra 5% that comes from running
this on 8 GPUs, it's a simple configuration change to change the back-end to TensorFlow.
Specifically, I want to show you that configuration change.
For those of you that haven't use batch before, when you see tilde "~", that just means your
home directory.
In your home directory, there is a .keras folder and in there is a keras.json file,
this is the configuration file, and you'll see here "backend" : "Theano".
If you change it to "backend" : "TensorFlow" then TensorFlow will use all of your GPUs.
If you do this, I also recommend changing the "th" (in "image_dim_ordering" : "th")
to "tf".
We may talk about that in the next course; it's a pretty minor detail.
The other configuration file to be aware of is the .theanorc.
You'll find a lot of Unix-y things, somethingrc is how they name their configuration files,
or .somethingrc -- here it is .theanorc.
I want to point out that there's a really line here which is "device=" this is either
"gpu" or "cpu".
If you're using a T2 instance, you'll find that the AMI we created has changed the "gpu"
to "cpu" and that's because the T2 instance does not support gpu.
So if you want to switch from gpu to cpu, just change the "g" to a "c", or "c" to a
"g".
Those are the two configuration pieces that you may need to know about.
For this class, you won't really need to know about those because everything's been set
up for you, but I like to show you what's going on behind the scenes.
This warning that cuNN is too recent.
If you see any problems, try updating Theano or downgrading cuDNN to version 5.
I haven't found any problems, so you can ignore that warning.
It just means that we're using a more up-to-date version of cuDNN than the authors have tested.
So we create our VGG object [vgg = vgg16()].
In doing so, there's a whole bunch of stuff going on behind the scenes, we're going to
look at all of it.
By the end of the next lesson, you will understand every line of code in our vgg script.
For now, I would just point out that you can look at it, and inside it, you'll see there
is 100 lines of code - so it's not very big at all.
And we're going to understand all of it by the end of next class.
[Time: 1.30 hour mark]
For now, there's a pre-trained network called vgg16(), we now have a vgg object which gives
us access to that pre-trained network.
With deep-learning, we don't look at images one at a time, we also don't look at them
the whole dataset at a time.
We look at them a few at a time.
That number, the few that we look at, we call either a batch or a mini-batch.
A mini-batch
is simply grabbing (in this case images) a few images at a time.
And the size of that is the size of the mini-batch, and computing on all of those at once.
Why don't we do one at a time?
The reason that we don't do one at a time is because a GPU needs to do lots of things
at once to be useful.
It loves running on like thousands and thousands of things at the same time because it can
do all of them at the same time.
So a single image is not enough to keep a GPU busy, and it's slow.
Why not do all of it, the whole dataset at once.
First of all, your GPU only has a certain amount of memory.
Generally somewhere between about 2G and 12G and your dataset is unlikely to fit in that
amount of memory.
And secondly, there's no need to do the whole lot, anything we want to do we can do with
small amounts at a time.
So in this case, I'm just going to show you how we can look at the results of this vgg
model, and we're just going to do 4 at a time.
So there's a get_batches command, which basically says, in our VGG model, let's look inside
the path and grab 4 at a time.
So were in the samples, and there's 16 images, so let's grab one batch.
We're going to grab 4 images and 4 labels.
Here are the 4 images, and here are the 4 labels.
You can see it's labeled [0,1] if it's a dog and it will be [1,0] if it's a cat.
Now that we've done that (so that's basically what our data looks like), we can call vgg.predict,
passing in the images.
So that's going to ignore the labels of what it actually is, it's going to use this pre-trained
model and it's going to tell us what it thinks the four things are.
In this case we run it and it thinks they are a Rotweiler, an Egyptian cat, a toy terrier
and a Rotweiler.
So you see it's clearly made a mistake here.
It's very rare that it makes a mistake, it must have been confused by all the stuff going
on in the background.
So it's also shown you for the toy terier that it's only 24% sure that it's a toy terrier.
So you can see that it knows it's not sure.
Whereas for the Rotweiler, it's very sure it's a Rotweiler.
How come it's not so sure that it's an Egyptian cat?
That's because there's a lot of cats that look a bit like an Egyptian cat, it doesn't
quite know which one it is.
We could have a look at all those details to see which other ones it thought it could
be.
We'll be looking at that in the next lesson.
So the final thing I'm going to do is to take these probabilities and turn them into a dogs
vs cats model.
I'm going to do it quickly now, and then I'm going to revisit it in the start of the next
class.
[Time: 1.35 hour mark]
So to take that 1000 probabilities (we're just showing one proability from each but
there's actually 1000 probabilities), and turn it into a dog vs cat prediction, we basically
do exactly what we did before, do vgg.get_batches then we call finetune, vgg.finetune(batches).
What finetune's going to do is build a new model and and it's going to replace the 1000
categories with the 2 classes that it's found.
How does it know what the 2 classes are?
That's because we have directories called cats and dogs.
So the finetune command has now created a model that checks for cats and dogs.
Just creating the model is not enough, we have to actually run it, train it.
So if we then go vgg.fit(batches, val_batches, nb_epoch=1), it will then use that gradient
descent method, that back-propagation, that I talked about earlier and it will attempt
to make that model and get better at determining cats vs dogs.
Now obviously doing it on just 6 data items is fast, but not very accurate.
I can run it a few times and you can see that the accuracy is getting higher and higher
each time, but the validating accuracy is not getting much higher, and that's because
I'm running it on the sample.
So if I ran it on the full dataset, it would take about 5 minutes to run, and you can try
it when you get home.
Give it a go and see what accuracy you get.
If you want to get the accuracy higher, just rerun this cell a bunch of times.
That's the end of today's class.
This first class is kind of like the opening of a novel, when you introduce all the characters
and their back-stories.
A little bit less deep-learning goes on in the first class, a little bit more getting
set up.
Your first week, for many of you, is likely to be the most frustrating and challenging
week because many of you will find that you have some kind of configuration problem, or
you don't understand how some piece of stuff fits together.
Don't worry.
By the end of the 7 weeks, that stuff's going to be straightforward and all of the interesting
bit will be in the deep-learning.
The more time that you can put in to this week, making sure that you get all the infrastructure
stuff working and comfortable about what it is.
Take a look at the things I've introduced today, look at the video, google all the stuff
you're not already familiar with.
Understand how it works.
Anything you're unclear about, ask your colleagues on the Slack Channel or on the Forums.
Teaching is the best way to learn, so go to the wiki and try to explain the things you've
learned.
Make sure that you can run the code that we've seen today, up to here.
For those of you that are pretty familiar with this already, make sure that you can
run this code on a different dataset.
And we'll talk about some different datasets that you can use tomorrow.
Any of you that want to go further, please let Rachel or I know, we have lot's of ideas
for ways that you can extend this a long way.