Placeholder Image

Subtitles section Play video

  • FEMALE SPEAKER: Please join me in

  • welcoming Mr. Kenneth Cukier.

  • [APPLAUSE]

  • KENNETH CUKIER: Thank you very much.

  • You can probably appreciate the fact that I've got a lot

  • of trepidation coming here to talk to you folks for the

  • obvious reason that I'm wearing a suit.

  • And the truth is I had a breakfast this morning at the

  • Council on Foreign Relations to talk to them about the

  • international implications and the foreign-policy

  • implications of big data.

  • That leads to the second trepidation and the context of

  • my remarks.

  • So the second trepidation is that this is a sort of

  • homecoming for the book.

  • Because my journey, so to speak, in the world of big

  • data started at Google and started at the

  • Googleplex in 2009.

  • It was you folks who opened up the kimono to what you were

  • doing in very small little slivers.

  • I never got the full picture.

  • But I was able to cobble it all together and see something

  • and then give it a label to it.

  • Luckily, there was a couple of labels that we

  • were thinking of.

  • And I reached for one that wasn't a popular term at the

  • time, and the term was big data.

  • And that was really helpful.

  • It was the cover story of "The Economist"

  • in February of 2010.

  • It was called "The Data Deluge", because they thought

  • they would sell it better than saying "big

  • data." But big data--

  • it was basically all about that and about what

  • you guys are doing.

  • And so it brings me great fear to walk into a room, because

  • you guys have been doing it for so long.

  • And that brings me into the context of my

  • conversation today.

  • I want it to be a conversation.

  • I was obviously just at the Council on Foreign Relations

  • thinking about this in ways that I am sure your engineers

  • never thought about it 10 years ago.

  • I may have heard a snort.

  • But here's the thing.

  • Many of you were thinking of it as a technological issue

  • when people around the world think of it in terms of the

  • competitivity of nations.

  • Our book, which is being released today in America, has

  • already been available in China, where it's been a

  • best-seller.

  • And when we hear questions from Chinese journalists to

  • us, they're all talking about the national project that

  • they're on.

  • Is this the way for us to leapfrog with the West?

  • Is this one area of technology, unlike the

  • internet and computing, where we can lead?

  • So the implications of this are vast.

  • And the implications are more than just technological.

  • I'm at a technology company-- in fact, the pioneer, in many

  • respects, of big data.

  • But I want to explain that I'm here as a journalist, as

  • someone who's looked in at your world and now can serve

  • as a sort of a filter.

  • And what I'd like to do is show you that world from a

  • non-engineer's perspective, from someone who just is

  • curious about the world and society and thinks deeply

  • about these issues.

  • Now there's a second disclosure I have to make, and

  • that is not only am I talking about big data, but my

  • presentation is big data.

  • Because there's 70 slides.

  • On top of it, I haven't actually really seen the

  • slides except for once or twice, because they just

  • arrived to my inbox this morning from someone who was

  • putting it together for me.

  • This is actually the recipe for disaster, so please have

  • forbearance.

  • I'm going to go really quickly, and I'm probably

  • going to skip through a couple of these slides.

  • So let me start with a story, and the story is the story of

  • a company called Farecast.

  • And it begins in the year 2003.

  • A guy named Oren Etzioni at the University of Washington

  • is on an airplane.

  • And he asks people how much they paid for the seats.

  • And it turns out, of course, for one person paid one fare,

  • and one person paid another fare.

  • But this made Oren Etzioni really, really upset.

  • And the reason why is that he took the time to book his air

  • ticket long in advance, figuring he was going to pay

  • the least amount of money.

  • Because that's the way the system worked.

  • And then he realized actually that that wasn't the case.

  • When he figured that out, he was really upset.

  • And he figured, if only I could knew what is the meaning

  • behind airfare madness.

  • How would I know if a price I'm being presented with at an

  • online travel site is a good one or a bad one?

  • And then he came up with the insight.

  • Because he's like you-- he's a computer scientist--

  • he realized actually--

  • that's actually just an information problem.

  • And I bet I can get the information.

  • All I would need is one simple thing--

  • the flight price record of every single flight in

  • commercial aviation in the United States for every single

  • route, every flight, and to identify every seat, and to

  • identify how long in advance the ticket was bought for the

  • departure, and what price was paid, and just run it through

  • a couple computers, and then make a prediction on whether

  • the price is likely to rise or fall, and score my degree of

  • confidence in the prediction.

  • Pretty simple.

  • So he scraped some data.

  • And it works pretty well.

  • And he runs a system.

  • It's great.

  • The academic paper that he writes is called "Hamlet--

  • To Buy or Not To Buy, That Is the Question." It works well,

  • but then he realizes, hey, this works so well, I'm going

  • to get more data.

  • And he gets more data, until he has 20 billion flight-price

  • records that he's crunching to make his prediction.

  • And now it works really well.

  • Now it's saving customers a lot of money.

  • It gets a little bit of traction, and Microsoft comes

  • knocking on the door.

  • He's in Washington.

  • He sells it for about $100 million--

  • not bad for a couple years work, and a couple PhDs in

  • computer science that was working with him.

  • But behind this, the key thing is this.

  • He took data that was generated for one purpose and

  • reused it for another.

  • When the Sabre database--

  • at the time probably the airline reservation system and

  • one of the biggest, actually the biggest civilian computer

  • project at its time when it was created in

  • the '50s and '60s--

  • was created by American Airlines and IBM.

  • They never imagined for a million years that the data of

  • the passenger manifest was going to become the raw

  • material for a new business, and a new source of value, and

  • a new form of economic activity.

  • And we're going to be creating markets with this data.

  • And if you want to understand what big data is, at least

  • from a person looking into it--

  • because Google's been doing big data for a long time.

  • What we're seeing across society is what you folks have

  • been doing for years.

  • We're seeing that data is becoming a new

  • raw material of business.

  • It is the oil, if you will, of the information economy.

  • There's a lot of data around in the world today.

  • You know this.

  • The arresting statistics are obvious.

  • Whenever we put on a big new sky survey--

  • telescope for you and me-- goes online.

  • Whenever it goes online, it usually ends up collecting as

  • much data in the first night or two as in the history of

  • astronomy prior to it going online.

  • And obviously, the human genome, et cetera.

  • You all know the data about big data, so I won't spend too

  • much time there.

  • But what we see behind big data are three features of

  • society, or shifts in the way that we think about

  • information in the world--

  • more, messy, and correlations.

  • So more.

  • We're going from an environment where we've always

  • been information-starved--

  • we've never had enough information--

  • to one where we-- that's no longer the operative

  • constraint.

  • It's still a constraint.

  • Of course, we never have all the information.

  • What is information?

  • Is it really the real thing?

  • But what's clear is that instead of having to optimize

  • our tools to presume that we can only have a small sliver

  • of information, when that changes, we

  • can get a lot more.

  • And so what does more mean?

  • Well, think of it as 23andMe.

  • What they do is they actually take a sample of your DNA, and

  • they look for very specific traits.

  • Now that works well, but it's imperfect as well.

  • That's one reason why it's only $100--

  • a couple hundred dollars.

  • When Steve Jobs had cancer, he was one of the first

  • individuals in the world to have his entire genome

  • sequenced and his tumor sequenced as well.

  • So he had personalized medicine, and it was

  • individually tailored to the state of his

  • health at that time.

  • When one drug would work, they'd continue.

  • When the cells mutated and blocked the drug from working,

  • they routed around it and tried something else.

  • They were able to do that because they had all of the

  • data, not just some of the data.

  • And that's one of the shifts that we're seeing

  • from some to more.

  • And in some cases, n equals all the data.

  • We also have messy data.

  • That's another feature as well.

  • In the past, we had highly curated databases--

  • information that we optimized our tools to get in the most

  • pristine way as possible.

  • And this was sensible.

  • When there's only a small amount of information that you

  • can bother collecting and processing, because the cost

  • is so high and it's so cumbersome, you have to make

  • sure the information you get is the best

  • possible thing you can.

  • But when you can avail yourself of orders and orders

  • of magnitude more information, that constraint goes away.

  • And suddenly, you can allow for a little bit of messiness.

  • Now, it can't be completely wrong.

  • But messiness is good.

  • You folks are pioneers of this in machine translation.

  • And you know the famous Peter Norvig, and Allen Harvey, and

  • others' paper on the unreasonable

  • effectiveness of data.

  • The idea here is that machine translation worked actually--

  • was a real step up.

  • When IBM tried it in around '56 with 20 Russian phrases

  • and English phrases that they programmed the computer to

  • translate, it looked impressive.

  • It was ridiculous, of course.

  • We now know.

  • It's like a punch card.

  • Then when IBM's project Candide came around in the

  • '90s, actually that was not machine translation.

  • That was statistical machine translation.

  • That was really good, relatively speaking.

  • What they did is they took the Canadian Hansard--

  • the parliamentary transcripts that were translated into both

  • English and into French--

  • and they just let the computer make the inferences of when a

  • word in French, and it would be a useful substitute for the

  • one in English.

  • They didn't try to presume what was

  • right or what was wrong.

  • They let the computer infer that itself and score the

  • probability that one would be the right word or not in that