Subtitles section Play video Print subtitles FEMALE SPEAKER: Please join me in welcoming Mr. Kenneth Cukier. [APPLAUSE] KENNETH CUKIER: Thank you very much. You can probably appreciate the fact that I've got a lot of trepidation coming here to talk to you folks for the obvious reason that I'm wearing a suit. And the truth is I had a breakfast this morning at the Council on Foreign Relations to talk to them about the international implications and the foreign-policy implications of big data. That leads to the second trepidation and the context of my remarks. So the second trepidation is that this is a sort of homecoming for the book. Because my journey, so to speak, in the world of big data started at Google and started at the Googleplex in 2009. It was you folks who opened up the kimono to what you were doing in very small little slivers. I never got the full picture. But I was able to cobble it all together and see something and then give it a label to it. Luckily, there was a couple of labels that we were thinking of. And I reached for one that wasn't a popular term at the time, and the term was big data. And that was really helpful. It was the cover story of "The Economist" in February of 2010. It was called "The Data Deluge", because they thought they would sell it better than saying "big data." But big data-- it was basically all about that and about what you guys are doing. And so it brings me great fear to walk into a room, because you guys have been doing it for so long. And that brings me into the context of my conversation today. I want it to be a conversation. I was obviously just at the Council on Foreign Relations thinking about this in ways that I am sure your engineers never thought about it 10 years ago. I may have heard a snort. But here's the thing. Many of you were thinking of it as a technological issue when people around the world think of it in terms of the competitivity of nations. Our book, which is being released today in America, has already been available in China, where it's been a best-seller. And when we hear questions from Chinese journalists to us, they're all talking about the national project that they're on. Is this the way for us to leapfrog with the West? Is this one area of technology, unlike the internet and computing, where we can lead? So the implications of this are vast. And the implications are more than just technological. I'm at a technology company-- in fact, the pioneer, in many respects, of big data. But I want to explain that I'm here as a journalist, as someone who's looked in at your world and now can serve as a sort of a filter. And what I'd like to do is show you that world from a non-engineer's perspective, from someone who just is curious about the world and society and thinks deeply about these issues. Now there's a second disclosure I have to make, and that is not only am I talking about big data, but my presentation is big data. Because there's 70 slides. On top of it, I haven't actually really seen the slides except for once or twice, because they just arrived to my inbox this morning from someone who was putting it together for me. This is actually the recipe for disaster, so please have forbearance. I'm going to go really quickly, and I'm probably going to skip through a couple of these slides. So let me start with a story, and the story is the story of a company called Farecast. And it begins in the year 2003. A guy named Oren Etzioni at the University of Washington is on an airplane. And he asks people how much they paid for the seats. And it turns out, of course, for one person paid one fare, and one person paid another fare. But this made Oren Etzioni really, really upset. And the reason why is that he took the time to book his air ticket long in advance, figuring he was going to pay the least amount of money. Because that's the way the system worked. And then he realized actually that that wasn't the case. When he figured that out, he was really upset. And he figured, if only I could knew what is the meaning behind airfare madness. How would I know if a price I'm being presented with at an online travel site is a good one or a bad one? And then he came up with the insight. Because he's like you-- he's a computer scientist-- he realized actually-- that's actually just an information problem. And I bet I can get the information. All I would need is one simple thing-- the flight price record of every single flight in commercial aviation in the United States for every single route, every flight, and to identify every seat, and to identify how long in advance the ticket was bought for the departure, and what price was paid, and just run it through a couple computers, and then make a prediction on whether the price is likely to rise or fall, and score my degree of confidence in the prediction. Pretty simple. So he scraped some data. And it works pretty well. And he runs a system. It's great. The academic paper that he writes is called "Hamlet-- To Buy or Not To Buy, That Is the Question." It works well, but then he realizes, hey, this works so well, I'm going to get more data. And he gets more data, until he has 20 billion flight-price records that he's crunching to make his prediction. And now it works really well. Now it's saving customers a lot of money. It gets a little bit of traction, and Microsoft comes knocking on the door. He's in Washington. He sells it for about $100 million-- not bad for a couple years work, and a couple PhDs in computer science that was working with him. But behind this, the key thing is this. He took data that was generated for one purpose and reused it for another. When the Sabre database-- at the time probably the airline reservation system and one of the biggest, actually the biggest civilian computer project at its time when it was created in the '50s and '60s-- was created by American Airlines and IBM. They never imagined for a million years that the data of the passenger manifest was going to become the raw material for a new business, and a new source of value, and a new form of economic activity. And we're going to be creating markets with this data. And if you want to understand what big data is, at least from a person looking into it-- because Google's been doing big data for a long time. What we're seeing across society is what you folks have been doing for years. We're seeing that data is becoming a new raw material of business. It is the oil, if you will, of the information economy. There's a lot of data around in the world today. You know this. The arresting statistics are obvious. Whenever we put on a big new sky survey-- telescope for you and me-- goes online. Whenever it goes online, it usually ends up collecting as much data in the first night or two as in the history of astronomy prior to it going online. And obviously, the human genome, et cetera. You all know the data about big data, so I won't spend too much time there. But what we see behind big data are three features of society, or shifts in the way that we think about information in the world-- more, messy, and correlations. So more. We're going from an environment where we've always been information-starved-- we've never had enough information-- to one where we-- that's no longer the operative constraint. It's still a constraint. Of course, we never have all the information. What is information? Is it really the real thing? But what's clear is that instead of having to optimize our tools to presume that we can only have a small sliver of information, when that changes, we can get a lot more. And so what does more mean? Well, think of it as 23andMe. What they do is they actually take a sample of your DNA, and they look for very specific traits. Now that works well, but it's imperfect as well. That's one reason why it's only $100-- a couple hundred dollars. When Steve Jobs had cancer, he was one of the first individuals in the world to have his entire genome sequenced and his tumor sequenced as well. So he had personalized medicine, and it was individually tailored to the state of his health at that time. When one drug would work, they'd continue. When the cells mutated and blocked the drug from working, they routed around it and tried something else. They were able to do that because they had all of the data, not just some of the data. And that's one of the shifts that we're seeing from some to more. And in some cases, n equals all the data. We also have messy data. That's another feature as well. In the past, we had highly curated databases-- information that we optimized our tools to get in the most pristine way as possible. And this was sensible. When there's only a small amount of information that you can bother collecting and processing, because the cost is so high and it's so cumbersome, you have to make sure the information you get is the best possible thing you can. But when you can avail yourself of orders and orders of magnitude more information, that constraint goes away. And suddenly, you can allow for a little bit of messiness. Now, it can't be completely wrong. But messiness is good. You folks are pioneers of this in machine translation. And you know the famous Peter Norvig, and Allen Harvey, and others' paper on the unreasonable effectiveness of data. The idea here is that machine translation worked actually-- was a real step up. When IBM tried it in around '56 with 20 Russian phrases and English phrases that they programmed the computer to translate, it looked impressive. It was ridiculous, of course. We now know. It's like a punch card. Then when IBM's project Candide came around in the '90s, actually that was not machine translation. That was statistical machine translation. That was really good, relatively speaking. What they did is they took the Canadian Hansard-- the parliamentary transcripts that were translated into both English and into French-- and they just let the computer make the inferences of when a word in French, and it would be a useful substitute for the one in English. They didn't try to presume what was right or what was wrong. They let the computer infer that itself and score the probability that one would be the right word or not in that particular context, and go forward. C-Change-- they tried to optimize it and make it better. Couldn't. Couldn't at a reasonable way. It just was-- it was a hard problem. Then Google came along. And you guys didn't avail yourself of just the parliamentary transcripts in French and English in Canada. You availed yourself of the World Wide Web. It wasn't 1994. It was 2006. You poured in. You got all of the European Union translations of all 21 languages. Your Google Books project became a signal for what was good and not because of the translations that you could find in the libraries. Now in many instances, the data was far less clean than in the past when we tried to do it with just a small amount of data. But the fact is more data beat clean data. Messiness was good. And the final point, which is obvious, is correlations. We have had a society in which we've always looked for causes behind things, and that made sense in a small-data world as well. In fact, causality is still very useful to know. But for a lot of the problems that we're dealing with these days, just knowing the correlation is good enough. And in fact, what we're finding is that often we think we see causality when we don't. And it's hard to do. So there's going to be cases where we actually still want to know the reasons why. But often, just knowing what is good enough, because we can learn the correlation and go with that. So a similar company like Farecast is Decide.com. This is Oren Etzioni's company again that basically looks across the web at all of the prices online, not just of airlines, but of anything that has lots of price data and high variability and just ranks it to say, is this a good price or not? And it leads to new markets. It leads to transparency, which is good for customers. More interestingly is what this means for human health. Premature babies, known as preemies. In the past, when we thought about health care, we would take the vital signs of someone maybe once or twice a day, couple more times if it was important enough. And a doctor would look at the clipboard at the edge of the bed and make a decision on what to do. Feedback loop was really, really long. Very, very imperfect. What we're now able to do-- some researchers in Canada are doing this, is they're looking at the real-time flow of 16 different streams of vital signs of premature babies. And they're able to score it and look for correlations with it. And when they do that, they find that they can spot the onset of infection 24 hours before overt symptoms appear. By doing that, that means that you can have an intervention sooner, see if the interventions working better, react to it, and save lives. But you learn something else as well. You would have thought-- and you can imagine generations of doctors looking at the clipboard, seeing the kid's vitals stabilizing, and thinking it was safe to go home to supper, that things were OK, and we'll treat the patient tomorrow. Just nurse, call me if there's a problem. And then to get a frantic call at midnight saying something had gone horribly wrong. The fact is, what we're finding is that one of the best predictors that there is going to be an onset of an infection is that the baby's vital signs stabilize. Weird, right? Why? We don't actually really know why, what's happening biologically. It kind of seems like the kid's little organs are just battening down the hatches for a rough night ahead. We don't know why, but we know that with that correlation, we can do something better. We can save its life. And we didn't know that before big data. Behind this is we have data. Why do we have data? Well, we're collecting more data for things that we've always collected data on. Weather-- great. That's fantastic. But it also is because we're collecting things that was always informational but we never treated as data before, like you. So you're all sitting. You're sitting down. And you are sitting different than you, and you, and you, and you. You weigh different. Your legs are different. Your posture is different. The distribution of weight. And you know that if I have 20 sensors on your seat and on your seat back that I can probably score with a high degree of accuracy who you are based on the way you sit. Why is that useful? Well, for one purpose, you can imagine that this would be a great anti-theft tool in cars. Put this in, and suddenly you would know that the authorized driver of your Lamborghini is you, and it's not someone else. Or if you have children, likewise, hey, I told you you can't take the car out after 10:00 PM. And so the engine didn't work when you tried to sit down and turn the keys in the ignition. That's great. But what else can you do with it? Well, think about it. If everybody has their car seat instrumentized and you actually datify posture, suddenly you would be able to, perhaps, identify the telltale signs of a shift in posture 30 seconds prior to an accident. The probability of you getting into an accident by a shift in your posture. Maybe what we've datafied is driver fatigue. And the service here would be the car would send an internal alarm. Maybe the steering wheel would vibrate, or there'd be a chime saying, hey, wake up. You have a high likelihood of getting into an accident right now. That's the sort of thing that is left to play for as we data-ize society in a world of big data. So what we're seeing is lots of things being datafied as well. Facebook datafies our friendships; Twitter, our stray thoughts, our whispers; LinkedIn, our professional contacts. Google datafies our intentions. So obviously, Google Flu Trends is a wonderful way to have a predictor of what the likelihood of outbreaks of flu are. Now that's great. It's just you don't want to have to know causality. You don't know why. It just is what it is. Now you may recall that there was a little bit of a grumble in the scientific community recently when they said, the CDC this year said that flu was going to be right here-- CDC, the Center of Disease Controls. And Google Flu Trends like this, it didn't work this year. Bullshit. How do we know? Because CDC is reported data. The person came in. Maybe because of the economic crisis, people decided they had to show up at work and didn't go see a doctor. Maybe the Google Flu Trends is accurate, and that's what's real. And CDC is just reported data, not observed behavior, isn't as good. No one thought of that. Big data has been with us for a while. It turns out that there was an American commodore who had data-ized all of the old log books inside of the dusty Navy trunks. And with that, he was able to create a whole new form of nautical map that told sailors not just where they were but the patterns of the winds. No one realized that the world, and the winds, and the waves conformed to natural patterns. If you will, that the sea had its own physical geography, and that if you [? allided ?] yourself with those things, that you could have a safer voyage. And we can do that now. But the problem is it took him a decade and dozens of people to do it, and we do the same sort of thing in about one sixth of a second every day. So it's a democratization play of techniques that we have tried to do in the past. We've done sometimes. Obviously, censuses have been around since Jesus was born. But now we're actually doing it in a widespread way. Predictive maintenance is a good example of taking the same idea about premature babies and applying it to machines. When your car is about to break down, it doesn't go kaput all at once. Usually, you can feel it. There's a grumble, or it just doesn't drive right. Well, now what we can do is instrumentize it, see what the data signature of the heat and the vibration is, find out how it correlates with previous incidences of a break down, and know, perhaps, two days in advance that your fan belt is going to break. And that's happening today in fleets of cars by UPS, and it's going to be in your car tomorrow if it's not already there. The value of data is hidden. It's hidden not in the primary purpose for what it was collected for. But now with big-data techniques, it's often uncovered in its multiple secondary uses that are just limited by our imagination. So INRIX is a car company that takes the sat-nav system and makes a prediction on how long it's going to take from one place to another. Sounds great. It's a good service. Use it. It's also used by economists to understand the health of economies. Because they see how cars drive, and the frequency and propensity of cars, and the travel times as a proxy, an indicator, for the health of a local municipal economy. Hedge funds use this information to look at the car circulation in the areas near a retailer on the weekends. And so prior to the quarterly announcements, they have a good indicator whether the sales are going to increase or decrease, and they can short or go long on its shares. No one would have thought of that in the past, that we could do that sort of thing with information. Obviously, everything that we do-- all of our interactions-- give off lots of data exhaust. You folks are experienced with that, because you treat all of the interactions of an individual who goes to your website as a signal for something else. You've built your systems and optimized it based on that form of data exhaust, by treating information as a new raw material that you can recycle back into the system to improve it or to create a whole new system altogether. There is going to be winners and losers in this new world. There's three features that seem to be distinguishing who's going to do well. And that's the skills, the mindset, and the data. The skills are kind of obvious. It's the people who have technical knowledge, or it's the vendors who sell you stuff. That's great. The mindset, in some ways, seems to be more important. Because what you need is not just the skills. That gets commoditized first, obviously. History of computing suggested that. The first computer scientists in the '50s and '60s-- actually, not the scientists. But these were the doers, the software programmers. They looked like they were sitting pretty wearing the white lab coats. But by the '70s and '80s, man, just an ordinary rinky-dink software developer had been largely commoditized. And we're going to see that with big data as well. Today, some companies and people are at the high end. It's going to filter through as the PhD programs get forward. The mindset is going to be critical and the creativity. But ultimately, both of those things are going to go to the wayside. Because if you remember, Jeff Bezos had a great dot-com mindset really early in '94 or so, and he executed brilliantly as well. But by the 2000, every executive was thinking about the web. So we're going to have the same thing with big data. That advantage doesn't hold that long. The data, however-- who has access to the data is going to be critical. That's the resource. So weirdly, and ironically, what seems to be abundant today is actually the source of scarcity, and vice versa. Now in New York, we have a problem with overcrowded buildings. But before I tell you that story, let me see much time we have, because I want to get questions as well. I think we're doing OK. So tenements, overcrowded buildings, and the problem of just stuffing 10 times as many occupants into a single dwelling as it was designed for. This is a bad thing, and it's a bad thing because it leads to crime. It leads to drugs. It leads to violence. And it leads to fires, and not just any kind of fires. Basically, these kind of fires kill the occupants. And they also end up injuring and killing at much greater rates the firefighters who go to help it out. So this is a serious problem for the city. The city gets something like 63,000 calls a year for complaints of illegal, overcrowded stuffing in buildings. And there's only 200 inspectors at City Hall. So there you see a problem. But the problem is actually a big-data problem. How can we solve it? Well, the first thing that they do is they take a list, a database of every single building in Manhattan and the five Boroughs, and that's 900,000, give or take. And then they look at everything as a signal, whether it's going to be a predictor that the thing is going to burst into flames, or that it's going to actually improve the model by predicting that it's not going to burst into flames. So they look at things like ambulance calls, utility cuts. Has there been a lean against the property? Is there complaints of rodent infestation? So the number of rats in the building is not datafied, but complaints to the city's 311 line is. So you find out the number of rodent complaints. And all together, you look at it. And you can score with a high degree that the building's going to burst into flame. They looked at weird stuff, from like the Department of Building Works on whether exterior brickwork had been recently done to the building. That improved the model too. Because if brickwork was done to the building, even if you had all these other problems that were high indicators of a fire, it went down. When they pressed go on the system, now what happens is an inspector, instead of going in-- and about 15% of the times they would make a visit. In the past, they would issue a vacate order, the stiffest sanction which basically says, everyone out in 24 hours. Before it was 15% of the time-- a high rate. So it tells you there's a big problem. Now it's 70% of the time. And so what that means is that they like it, the inspectors, because it's more effective. The fire department likes it, because their firemen aren't dying as much. And it's just good for all of us in our communities to see that people have good housing and that these buildings don't catch on fire. And that was because they turned the problem into a big data problem and solved it successfully with information. And they gave up trying to figure out the causality and just went with the correlation. There, of course, are serious issues of big data. One is going to be privacy. It's a problem now in the small-data era. It's going to be a bigger problem with big data. But there's going to be something on top of that. And that is not privacy, but propensity-- but prediction. The idea is that we're going to have algorithms predicting our likelihood to do things, our behavior. And it's going to be obvious that we're going to have governments try to-- or our businesses sanction us on the basis of that prediction. It's going to look a little bit like "Minority Report" and the idea of pre-crime. We're going to be denied a loan, because we're going to not have the likelihood to repay. But instead of this profiling universe in which we take us as a big clump-- and we have a small-data problem. Here are the 13 predictors, and this is the explicit rules by which we can tell you the formula. Imagine if we have 1,000 variables. It's a machine learning algorithm, and when we try to knock on the door in front of a court and say, I was denied surgery because you said I had a 90% mortality rate after five years with my individual data, I want you to disclose to me how you came to that decision because that seems unfair. They're going to say I don't know. I can show you the formula. I've frozen every instantiation of the data at every moment, because regulators required me to. If I printed out the formula, it would be on 600 pages. You need a PhD to understand it. It's true you have only 40 strong signals, but you've got a long tail of 600 weak signals that all went into it. I can't tell you why. And then the person's going to say, OK, I don't even care about that. What makes you think I'm not going to be part of the 10% that's going to live past 10 years? Why are you denying me this operation because you think I've got a high probability of not surviving longer after it? I want to take the test. And you can imagine with criminal-justice systems, it's going to be the exact same thing. This is the issue that we have. And so what do we have to preserve in this instance? Well, if it's the case of whether we think this fellow's going to shoplift in the next 12 months with a 99-degree percent probability, he can rightly say, I don't even care about the mumbo-jumbo of big data. I'm part of the 1% that's going to exercise free will, moral choice, and do the right thing. Now of course, all 100 of those individuals will say the same argument. But it does mean that it seems like we're going to have to create a new value in our world. Just as the printing press gave us the consciousness of free speech-- prior to the printing press, we didn't have a guarantee or the consciousness of free speech. When Socrates drank the hemlock for corrupting the youth of Athens, in his apologia did he make a free-speech claim? No. It didn't exist. It took the printing press to give us this idea that expression was something that needed to be protected. What will need to be protected in a world of big data? Well, maybe human volition, free will, responsibility. We have always had the risks of data. We're going to have to deal with this even more as we become more respectful of data and live with it in more parts of our lives. We've looked at data in lots of ways. America went to war over a statistic in a data point, and we saw the problems there. We're going to need regulators to think about how we can adopt this and get the most benefits of it. Probably one of the biggest is going to be giving us a degree of transparency. When there was an information explosion at the beginning of the 19th century and it was financial information, we created accountants to do the bookkeeping and auditors to do the surveillance function on top of the accountants. And I think that in the future-- and we mention this in the book-- that we're going to have to create a new professional class. And we might as well call them algorithmists, who are going to be trained in big-data techniques and actually serve internal to companies, as well as external in terms of an expert witness and a master to a court, that they can actually understand what's going on and serve as a translation function between the public interest and what's happening in the mathematics. For privacy, the shift is probably going to have to go from regulating the collection of data, as we do now in these preposterous screens of 60 lines of all-capped letters in which you just say, I agree with the terms of service and not read it, to something where we actually regulate use. And luckily, that seems to be an idea that is actually gathering steam. It's a lot harder for regulators, but it's definitely better for businesses. And it's definitely better for consumers. And of course, we're going to have to sanctify human volition. There is a role for antitrust regulators, as well. Antitrust turns out to be an extremely fertile and fungible public policy, because it's technology-neutral. It doesn't really make many presumptions about what it's regulating. What it does is it just looks at market concentration. So it looks like it's going to be a very useful tool with which we try to understand what to do and to create an open market. Now there's a problem in this that I'm laying out, which is regulators can understand what scale looks like when it's something tangible, like a widget. Actually, the antitrust came out of the railroad, so we'll say a car, a carriage. And then they applied it to telephones, and they called it common carriage. There's still a Common Carrier Bureau at the FCC. The carrier, if you will, was from the Interstate Commerce Act where the language was taken from. They then applied it to software markets-- Microsoft. What does scale mean in data? What does it mean when the data is doubling every three years or so? What does it mean when the market is changing form, that it's not the same market in five years as it was five years earlier? The businesses look different. It is going to be really difficult to do, but we're going to have to try to do that. Because we're going to need the assurances that we can have challengers as well as incumbents. We need both. This is about the way that we live. We're going to need to act with our humility and our humanity. Thank you very much. There's time for questions, so shout it out. There's microphones as well. Yeah, please. Go to the mike. Thanks. AUDIENCE: Hi, my name is Cynthia Elliott. I have a question. So I can see this data being very useful, like in let's say drafting of an athletic player. Have you ever encountered anything where colleges would want you to use this data to determine who could have the athletic ability to be drafted? KENNETH CUKIER: Yeah, well colleges and professional sports are using the data already right now quite a lot. The whole book and movie "Moneyball" was just about that. Partially about new statistics and new ways of examining it, but then partially just applying data to it. And Nate Silver, of course talks a lot about just trusting the data and just doing-- Nate Silver is not doing big data. He's just doing data. He's doing statistics, but the small difference is he's just listening to it. He's just doing it seriously and trusting it. So this is going to actually change lots of the ways that we evaluate people. So when we think about students and education, right now what a teacher does is it scores what every person in the class's grade and tells everyone, this person got a 95, and this person got an 85, and this person got a 75. The teacher doesn't actually look at what is the content of the-- or rarely, what is the content of what was corrected and what was wrong. What if that teacher was to find out that all of the students in a math exam got the exact same-- not all. But let's say 80% of the students got the exact same answer wrong with the same answer. Suddenly, he or she might say, hmm, I mistaught it. They inverted the algebraic equation thinking that it could be a or b and b or a. But in fact, the sequence matters, and I've got to go back and teach that. So not only does the student learn more, but the teacher learns as well. So in terms of drafting, sports is one of the first people to adopt these techniques. And it's actually changing how they think about their game. Certain players-- why would you have a defense for the opposing team? You want a defense for who the player is because of his propensity to score a shot-- if it's basketball-- from one part of the court versus another. If that player on the left side always misses the basket there, let him to take a shot. I'll take it on the rebound and then pass it up. Versus oh, don't let this guy get here into the key. Then we're really in trouble. AUDIENCE: With regulation, I think it's a very different situation. Because in the previous antitrust things, it's been actually pro-consumer to limit the amount of data people have. There's a reason people go to Amazon, because they've got more data and better data than anybody else. So in fact if you use antitrust against there, you may in fact make life worse for the consumers rather than better. KENNETH CUKIER: That is absolutely true. Now the question where this is going to take us-- and we need to have a societal conversation about it-- is whose data is it? Whose rights to the data should it be? Does the individual own the data because it's his or her click stream? That sounds logical. But of course, they decided to go to that website, and that website invested in collecting the data and analyzing it. Should they be required to hand it over, particularly if they're going optimize-- let's take Amazon-- their own algorithms so that they can make great recommendations. Why would they want to be able to give that to the customer so they can hand it off to Barnes and Noble? You're enriching your competitor. That sounds almost like eminent domain. That sounds like a governmental takings. We don't know how to answer these questions. But the point here is that what rights does the individual have to his or her data? Should it be transparent? Should there be data portability? For telephone numbers, we had to create number portability. And that seems to be a very good way to get carriers to actually love us rather than to lock us in. Do we need the same thing in the world of big data? I think that your public-policy people should be thinking about it. They probably already are. And we need to-- and not coming up with answers, but starting the discussion and having the debate. Please. AUDIENCE: One of the three things which you mentioned are important is mindset. So you have some framework or principles that one can follow and improve on that? Because that's one thing which is not as commoditized as with [INAUDIBLE]. KENNETH CUKIER: Yeah, no. I don't have any-- there's no simple list or there's no recommendations I would put forward. Because this is sort of about the spark of creativity. It's a little bit da Vinci like. So I think what is required is just for very creative individuals to look at what's going on around them and breath the hurly-burly of humanity and see the filament of it. The whole point of Google was page rank. But that was just the algorithm. The genius was to understand that every single interaction with the content gave you another signal to improve the search result. And that was, if you will, all you need is one good idea in life. And if you really go full-throttle with it-- and it's a good idea-- it's limitless. So like Oren Etzioni just had this idea that I can take something that nobody knows the answer to. But the answer exists. It's hidden in plain sight. I can get the data. And if I do the right thing with the data, I can actually transform it and get the insight that we need and create new forms of value. There is no simple way to develop that sort of mindset. A lot of it is luck. There are lots of people that I know of prior to Oren Etzioni who had that idea. There was a company called Strong Numbers in Boston about 1998 by Jeff Hyatt. He wanted to be the Blue Book for everything. He thought about all the things that you could do with data. He was a little bit too early. Dot-com bubble burst, and so did his dreams. Went on to build other companies and do very well for himself. But it just shows that there's a lot of factors involved. AUDIENCE: Hi. So my first question was actually already answered by you in terms of who owns the data. But more specifically with the American Express example. So do you have to purchase that data from-- not American Express. American Airlines. Did he have to purchase the data, or was he able to gain access to the data? KENNETH CUKIER: Great question. So the point about American Airlines was that was the airline carrier who built the original airline computerized reservation network called Sabre. So when Oren Etzioni wanted to get the data, he wasn't going to go to Sabre. And the reason why is Sabre is probably the biggest airline reservation network, and they have no incentive to sell it. Because that's just not what they do. So he had to go to one of the start-ups, one of sort of the hungrier people that were the challengers in there to get the data. And he found one called ITA Software. OK, you see where this story's going to end. This is great. So he goes to ITA Software and says, will you do it? And man, they've got a problem on their hands. Because on one hand, they need the data from the airlines, and this is going to really screw over the airlines. On the other hand. Oren's going to pay them a little bit of money and give them a commission. So they don't know what to do. So actually, I interviewed-- and they're just a bunch. Are these airline executives? No. These are a bunch of MIT PhDs and stats who did it because they thought it was a really complex problem and a lot of fun. So I interviewed one of the co-founders. And I said, well, what did you do? And he said, well, the truth is we actually kind of came up with that idea ourselves independently before Oren did it. And we did it internally just for fun, but we could never release it as a product. Because we just knew it was going to really harm us. We'd never be able to get the data from people. But this was a way that we could license the data at an arm's-length way and still get a couple guineas for it. And he had the data. So it shows you that there's these competitive interests. So what happened to ITA Software? They were acquired. For how much? Between $700 million and $800 million dollars? By whom? By Google. So why? Well, you guys can answer that. I know that when I do my searches, I see the airline listings in, and that's a great feature. But I'm sure you're playing a stronger hand and a lot more long-term one. The regulators walked in. It was one of the real first substantial essentially antitrust remedies against Google on this. And what it was it was sort of a must-license provision with a reporting requirement going back to the companies, and essentially to the FTC. For a period of a couple years-- maybe two or three, maybe five years-- they couldn't actually cut off the license that they had with people like Kayak, et cetera, because they were afraid you were going to become the world's biggest travel agent and dominate everyone. But the point here is this. Think about the sums. Oren Etzioni's Farecast, $100 million. ITA Software, $700 million. The difference here is that the algorithm and the skills is really good, and the service is really good. But he who has the data-- that's the gold. AUDIENCE: Do you see this societal shifts, particularly in America? I think big data and individualism is a really interesting area to think about and whether this changes the way that we think about an individual's role in society. The examples you were giving of a criminal who is 99% likely to recommit versus 1% likely to rebuild his life and be a productive member of society. Hedging for the benefit of society is to put all 100 back into jail and leave them there. How does this impact how we think about individual benefit versus societal benefit? And does this also play into future governmental shifts? KENNETH CUKIER: OK. So in the case of a criminal-justice system, let's have that debate. Because I think it's not an easy one, and anyone who thinks it is doesn't understand the problem. If I can tell you with a 99% degree of accuracy that that man is going to commit a violent crime, I would be remiss from intervening. And it would look like I'm almost anti-science if I said, you know, I just think that 1%-- we got to give them the benefit of the doubt. You just never know. It just doesn't look right. So on the other hand, this is one of the most heinous affronts to the dignity of the individual that we could ever conceive. And we don't have any experience thinking through this issue. This is so essentially that we figure it out, that we have the debate, that the debate starts now. Now what about data for individuals? What does that mean? Well, in Athens if you were a male, you served in the military. If you didn't want to sever in the military, leave. Right now, we think that one of the most precious things we have is data about our bodies, our health care, our privacy. Let's change the debate. Let's change the argument entirely. Let's invert the burden of proof. Let's just say that if you're a citizen of a country, you have to share your data on your health care in a global commons so that researchers can learn from it and treat everyone's health better. You don't have to do that. Leave. It might sound draconian, but the fact is do you have a property right, or some sort of moral right to your data? Well, I don't to my image if I'm walking down the street. And we do know that I can learn a lot from the data. And we also know from stats that if we allow some people to back out for whatever reason, it really becomes very imperfect. So suddenly, I think that we should change the debate. And I think the most obvious one would be health care. But often, when you look at these issues-- as you've pointed out to in big data-- these are new issues that we have for us. We've had a bit of sloppy thinking about it lately, because we haven't had to deal with it. And because the whole issue of big data has been absconded with by the technology vendors as the latest flavor of chocolate ice cream. But now, let's calm down, and let's think about it. The debate should start now. AUDIENCE: My question is kind of a follow-up. So let's assume that all data is public. You don't have any data barons. How does that influence the game? And specifically, humans change their behavior, and how does having data that's correct as of yesterday based on what I can infer-- I'm sorry-- all of us in [INAUDIBLE] can infer and predict what we all of us are going to do tomorrow if we are reacting to this data? KENNETH CUKIER: Yeah. So yeah, there's a great circularity that the data is going to be making predictions. We're going to learn about these predictions, and we're going to change our behavior based on those predictions ever thus. Right? This is just going to be the reality that we're going to live with. So in a way, the data will always be fallible. You'll never have the perfect prediction, because you can always-- this individual will learn that the algorithms nailed me for shoplifting before I've even gone into the store. I'm not now going to go into the store, and so the prediction was wrong. Now if we arrest him, the burden of proof is gone. Because we can never actually validate the fact the prediction was going to be accurate, because we never allowed him to commit the crime. This recursvity-- this weird pernicious circle in which we're constantly reacting to the algorithm, and thereby changing the prediction that we're making, is going to be a feature of life. And you can imagine that this is another conversation we need to have, another thing we have to think through. Yeah. Good. Thank you very much. It's been a delight.
B1 data big data oren kenneth prediction problem Viktor Mayer-Schonberger and Kenneth Cukier, "BIG DATA: A Revolution That Will Transform..." 129 7 richardwang posted on 2014/04/02 More Share Save Report Video vocabulary