Web Crawler - CS101 - Udacity - VoiceTube: Learn English through videos!

Subtitles section Play video

[Sebastian Thrun] So what's your take on how to build a search engine,
you've build one before, right?
[Sergey Brin - Co-Founder, Google] Yes. I think the most important thing
if you're going to build a search engine
is to have a really good corpus to start out with.
In our case we used the world wide web, which at time was certainly smaller than it is today.
But it was also very new and exciting.
There were all sorts of unexpected things there.
[David Evans] So the goal for the first three units for the course is to build that corpus.
And we want to build the corpus for our search engine
by crawling the web and that's what a web crawler does.
What a web crawler is, it's a program that collects content from the web.
If you think of a web page that you see in your browser, you have a page like this.
And we'll use the udacity site as an example web page.
It has lot's of content, it has some images, it has some text.
All of this comes into your browser when you request the page.
The important thing that it has is links.
And what a link is, is something that goes to another page.
So we have a link to the frequently asked questions,
we have a link to CS 101 page.
There's some other links on the page.
And that link may show in you browser with an underscore,
it may not, depending on how your browser is set.
But the important thing that it does,
is it's a pointer to some other web page.
And those other web pages may also have links
so we have another link on this page.
Maybe it's to my name, you can follow to my home page.
And all the pages that we can find with our web crawler
are found by following the links.
So it won't necessarily find every page on the web
If we start with a good seed page
we'll find lot's of pages, though.
And what the crawler's gonna do is start with one page,
find all the links on that page, follow them to find other pages
and then on those other pages it will follow the links on those pages
to find other pages and there will be lot's more links on those pages.
And eventually we'll have a collection of lot's of pages on the web.
So that's what we want to do to build a web crawler.
We want to find some way to start from one seed page,
extract the links on that page,
follow those links to other pages,
then collect the links on those other pages,
follow them, collect all that.
So that sounds like a lot to do.
We're not going to all that this first class.
What we're going to do this first unit, is just extract a link.
So we're going to start with a bunch of text.
It's going to have a link in it with a URL.
What we want to find is that URL,
so we can request the next page.
The goal for the second unit
is be able to keep going.
if there's many links on one page, you will want to be able to find them all.
So that's what we'll do in unit 2,
is to figure out how to keep going to extract all those links.
In unit three, well, we want to go beyond just one page.
So by the end of unit two we can print out all the links on one page.
For unit 3 we want to collect all those links, so we can keep going,
end up following our crawler to collect many, many pages.
So by the end of unit three we'll have built a web crawler.
We'll have a way of building our corpus.
Then the remaining three units will look at how to actually respond to queries.
So in unit four we'll figure out how to give a good response.
So if you search for a keyword, you want to get a response that's a list of the pages
where that keyword appears.
And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.
And then in unit six what we want to do is, well, we don't just want to find a list,
we want to find the best one.
So we'll figure out how to rank all the pages where that keyword appears.
So we're getting a little ahead of ourselves now,
because all we're going to do for unit one,
is to figure out how to extract a link from the page.
And the search engine that we'll build at the end of this
will be a functional search engine.
It will have the main components that a search engine like Google has.
It certainly won't be as powerful as Google will be,
we want to keep things simple.
We want to have a small amount of code to write.
And we should remember that our real goal
is not as much to build a search engine,
but to use the goal of building a search engine as a vehicle
for learning about computer science
and learning about programming
so the things we learn by doing this
will allow us to solve lot's and lot's of other problems.