Placeholder Image

Subtitles section Play video

  • MATT CUTTS: Hi, everybody.

  • We got a really interesting and very expansive question

  • from RobertvH in Munich.

  • RobertvH wants to know--

  • Hi Matt, could you please explain how Google's ranking

  • and website evaluation process works starting with the

  • crawling and analysis of a site, crawling time lines,

  • frequencies, priorities, indexing and filtering

  • processes within the databases, et cetera?

  • OK.

  • So that's basically just like, tell me

  • everything about Google.

  • Right?

  • That's a really expansive question.

  • It covers a lot of different ground.

  • And in fact, I have given orientation lectures to

  • engineers when they come in.

  • And I can talk for an hour about all those different

  • topics, and even talk for an hour about a very small subset

  • of those topics.

  • So let me talk for a while and see how much of a feel I can

  • give you for how the Google infrastructure works, how it

  • all fits together, how our crawling and indexing and

  • serving pipeline works.

  • Let's dive right in.

  • So there's three things that you really want to do well if

  • you want to be the world's best search engine.

  • You want to crawl the web comprehensively and deeply.

  • You want to index those pages.

  • And then you want to rank or serve those pages and return

  • the most relevant ones first.

  • Crawling is actually more difficult

  • than you might think.

  • Whenever Google started, whenever I joined back in

  • 2000, we didn't manage to crawl the web for something

  • like three or four months.

  • And we had to have a war room.

  • But a good way to think about the mental model is we

  • basically take page rank as the primary determinant.

  • And the more page rank you have-- that is, the more

  • people who link to you and the more reputable those people

  • are-- the more likely it is we're going to discover your

  • page relatively early in the crawl.

  • In fact, you could imagine crawling in strict page rank

  • order, and you'd get the CNNs of the world and The New York

  • Times of the world and really very high page rank sites.

  • And if you think about how things used to be, we used to

  • crawl for 30 days.

  • So we'd crawl for several weeks.

  • And then we would index for about a week.

  • And then we would push that data out.

  • And that would take about a week.

  • And so that was what the Google dance was.

  • Sometimes you'd hit one data center that had old data.

  • And sometimes you'd hit a data center that had new data.

  • Now there's various interesting tricks

  • that you can do.

  • For example, after you've crawled for 30 days, you can

  • imagine recrawling the high page rank guys so you can see

  • if there's anything new or important that's hit on the

  • CNN home page.

  • But for the most part, this is not fantastic.

  • Right?

  • Because if you're trying to crawl the web and it takes you

  • 30 days, you're going to be out-of-date.

  • So eventually, in 2003, I believe, we switched as part

  • of an update called Update Fritz to crawling a fairly

  • interesting significant chunk of the web every day.

  • And so if you imagine breaking the web into a certain number

  • of segments, you could imagine crawling that part of the web

  • and refreshing it every night.

  • And so at any given point, your main base index would

  • only be so out of date.

  • Because then you'd loop back around and you'd refresh that.

  • And that works very, very well.

  • Instead of waiting for everything to finish, you're

  • incrementally updating your index.

  • And we've gotten even better over time.

  • So at this point, we can get very, very fresh.

  • Any time we see updates, we can usually

  • find them very quickly.

  • And in the old days, you would have not just a main or a base

  • index, but you could have what were called supplemental

  • results, or the supplemental index.

  • And that was something that we wouldn't crawl and refresh

  • quite as often.

  • But it was a lot more documents.

  • And so you could almost imagine having really fresh

  • content, a layer of our main index, and then more documents

  • that are not refreshed quite as often, but there's a lot

  • more of them.

  • So that's just a little bit about the crawl and how to

  • crawl comprehensively.

  • What you do then is you pass things around.

  • And you basically say, OK, I have crawled a large fraction

  • of the web.

  • And within that web you have, for example, one document.

  • And indexing is basically taking things in word order.

  • Well, let's just work through an example.

  • Suppose you say Katy Perry.

  • In a document, Katy Perry appears right

  • next to each other.

  • But what you want in an index is which documents does the

  • word Katy appear in, and which documents does the word

  • Perry appear in?

  • So you might say Katy appears in documents 1, and 2, and 89,

  • and 555, and 789.

  • And Perry might appear in documents number 2, and 8, and

  • 73, and 555, and 1,000.

  • And so the whole process of doing the index is reversing,

  • so that instead of having the documents in word order, you

  • have the words, and they have it in document order.

  • So it's, OK, these are all the documents that a

  • word appears in.

  • Now when someone comes to Google and they type in Katy

  • Perry, you want to say, OK, what documents might match

  • Katy Perry?

  • Well, document one has Katy, but it doesn't have Perry.

  • So it's out.

  • Document number two has both Katy and Perry, so that's a

  • possibility.

  • Document eight has Perry but not Katy.

  • 89 and 73 are out because they don't have the right

  • combination of words.

  • 555 has both Katy and Perry.

  • And then these two are also out.

  • And so when someone comes to Google and they type in

  • Chicken Little, Britney Spears, Matt Cutts, Katy

  • Perry, whatever it is, we find the documents that we believe

  • have those words, either on the page or maybe in back

  • links, in anchor text pointing to that document.

  • Once you've done what's called document selection, you try to

  • figure out, how should you rank those?

  • And that's really tricky.

  • We use page rank as well as over 200 other factors in our

  • rankings to try to say, OK, maybe this document is really

  • authoritative.

  • It has a lot of reputation because it has

  • a lot of page rank.

  • But it only has the word Perry once.

  • And it just happens to have the word Katy somewhere else

  • on the page.

  • Whereas here is a document that has the word Katy and

  • Perry right next to each other, so there's proximity.

  • And it's got a lot of reputation.

  • It's got a lot of links pointing to it.

  • So we try to balance that off.

  • You want to find reputable documents that are also about

  • what the user typed in.

  • And that's kind of the secret sauce, trying to figure out a

  • way to combine those 200 different ranking signals in

  • order to find the most relevant document.

  • So at any given time, hundreds of millions of times a day,

  • someone comes to Google.

  • We try to find the closest data center to them.

  • They type in something like Katy Perry.

  • We send that query out to hundreds of different machines

  • all at once, which look through their little tiny

  • fraction of the web that we've indexed.

  • And we find, OK, these are the documents that

  • we think best match.

  • All those machines return their matches.

  • And we say, OK, what's the creme de la creme?

  • What's the needle in the haystack?

  • What's the best page that matches this query across our

  • entire index?

  • And then we take that page and we try to show it with a

  • useful snippet.

  • So you show the key words in the context of the document.

  • And you get it all back in under half a second.

  • So that's probably about as long as we can go on without

  • straining YouTube.

  • But that just gives you a little bit of a feel about how

  • the crawling system works, how we index documents, how things

  • get returned in under half a second through that massive

  • parallelization.

  • I hope that helps.

  • And if you want to know more, there's a whole bunch of

  • articles and academic papers about Google, and page rank,

  • and how Google works.

  • But you can also apply to--

  • there's jobs@google.com, I think, or google.com/jobs, if

  • you're interested in learning a lot more about how search

  • engines work.

  • OK.

  • Thanks very much.

MATT CUTTS: Hi, everybody.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it