Subtitles section Play video Print subtitles LEE FLEMING: Good evening. I am really pleased to welcome you all to "Leaders in Big Data" hosted by Google and the Fung Institute of Engineering Leadership at UC Berkeley. I'm Lee Fleming. I'm director of the Institute and this is a Ikhlaq Sidhu, chief scientist and co-founder. The first and most important thing is to thank Google for hosting the event. So thank you very, very much. There's a couple people in particular, Irena Coffman and Gail Hernandez-- thank you-- and also Arnav Anant, our entrepreneur in residence at the Fung Institute. So here's Arnav. AUDIENCE: A lot of work. LEE FLEMING: Huge amount of work. The Fung Institute-- we were founded about two years ago. And the intent is to do research and pedagogical development in topics of engineering leadership. We have our degree, the Master's of Engineering-- professional Master's of Engineering M. Eng. program-- mainly around the Institute. We also have ties though across the campus, as you'll see shortly. This is our intent to have a series of talks on topics of interest to engineering leaders. As it turns out, this Wednesday we have our next talk. It's sponsored by [? Thai ?] and the Fung Institute. And the topic is entrepreneurship-- being an entrepreneur within your firm. And fittingly, we have representatives from Google, and Cisco, and SAP. That's Wednesday. Consult the Fung website or the [? Thai ?] website for details on that. So besides enjoying a good discussion tonight, we have an ulterior motive, as you can probably tell. We're trying to advertise all of our fantastic programs in big data at Cal. Now, whether you're interested in computation, or inference, or application, or some combination of those things, we've got the right program for you. As I mentioned, the professional Masters of Engineering, or M. Eng., across all the different engineering departments-- one year degree. We have another one-year degree in the stats department-- a professional degree. There's a two-year degree in the Information School. And finally, there's the Haas MBA. Tonight we've got people from all these programs. You can find their tables, ask them questions, and hopefully we'll see you see at Cal soon. And we also have an additional executive and other programs associated with each of those departments and schools as well. Ikhlaq will now introduce our speakers. IKHLAQ SIDHU: OK, thanks. So let me see. LEE FLEMING: Just slide this here. IKHLAQ SIDHU: All right. Welcome, I want to also thank a couple of people. One is [? Claus Nickoli ?], who is not here at the moment, but to you in the ether, he's just not at the meeting. But he's our host here, and so thank you. You guys can tell him that I thanked him. And also, many of you I've seen here are basically friends, and so thanks for coming. It's good to see you again. This is an event on big data. And so I'm going to give you a little data on who is speaking today-- who is here. And the way I think of this is, what we've got is three perspectives of big data from leading firms-- from people who represent leading firms in the area. And so let's start with NetApp. We've got Gustav Horn. He is a senior consulting engineer with 25 years of experience. And he's built some of the largest enterprise-class Hadoop systems in the world-- on the planet. And from Google, Theodore Vassilakis, and he's a principal engineer at Google. He's ahead of the team that works on data analytics. And he's been responsible for numerous contributions to Google in terms [? about ?] search, and the visualization and representation of the results. And from VMware, Charles Fan, who's senior VP of strategic R&D. He co-founded Rainfinity and was CTO of the company prior to its acquisition by EMC in 2005. And our distinguished set of speakers is moderated by our distinguished moderator, Hal Varian. He is chief economist here at Google. He's an emeritus professor at UC Berkeley and the founding dean of the School of Information. So with that, there's hardly anything more I could possibly say. Come on up Hal and take it away. HAL VARIAN: Thank you. I'm very impressed with the turnout tonight, seeing as you're missing both the debate and the baseball game. But at least it eliminates a difficult choice for many people. I will say that I'm going to follow the same rules as the presidential debates. So no kicking, biting, scratching, or bean balls are allowed during this performance. We're going to talk about foreign policy, wasn't that the agreement? No. All right. In any event, what I thought we'd would do is, we'd have each person talk for about five minutes, lay out their theme, where they're coming from, what their perspective is on big data. And I will take some notes, and then ask some questions, get a conversation going. And I think we'll have a little time at the end for some questions from the floor. So, take it away. THEO VASSILAKIS: Sure. So, should I start, Hal? HAL VARIAN: Yes. THEO VASSILAKIS: All right. Well, hey it's a real pleasure to be here. Thank you guys also, and thank you guys for coming. It's a huge, huge audience. Just a couple of words. As you heard, my name is Theo. I lead some of our analytical systems. So I'm responsible-- well, actually up until two weeks ago, I was responsible for a stack that had parallel data warehousing components, query engines, pieces like Dremel, and Tenzing systems that let you query this data, and visualization layers on top. And that's one of the many, many systems at Google that I think, outside, one would think of as big-data type of systems. And so I'll try to give you my perspective at least on the Google view of big data. And hopefully someone will cut me off when it's time. I think I'll probably go for five minutes. This could take a while. AUDIENCE: [INAUDIBLE] THEO VASSILAKIS: All right, sounds good. Thank you. I think, as you guys know, Google's business is primarily about taking data and organizing the world's information, and making it universally accessible and useful. So a lot of what the company does is really about sucking in data-- whether it be the web, whether it be the imagery from Street View, or satellite imagery, or maps information, or Android pings, or you name it. And then transforming it into usable forms. So really, Google is kind of a big data machine in some sense. And I think the term big data came into currency relatively recently. And we all said, yeah, OK, that speaks to what we do. Because we don't really have a word for it. We just kind of knew that the data was large. But just to try to put maybe more structure on to that, I think the Google view on a lot of "what is big data processing" kind of splits up into probably what I would call ingestion type of processes-- things like the crawlers, things like all those Street View cars running through all the streets of the world. And then goes into transaction processing systems, where perhaps we capture data through interactions on a lot of our web properties, or a lot of the web properties that we partner with. This means people clicking on search, or people interacting with docs, or people interacting with maps. All generate many, many clicks and many, many interactions that then become transactional big data. Of course, that also includes people using let's say Google Analytics on their sites to measure traffic on their properties, which then generates huge volumes of pings into Google-- many tens of thousands of QPS of pings. So that's kind of the second big component. And then probably the third component is the processing side of all of that. The process side includes things like map [? reduce, ?] analysis, generating insights from that data-- maybe in the form of building machine learning models. Maybe in the form of building, for example, Zeitgeist top queries that can then be served out to the world to say, hey here is what people are searching for. Maybe in the form of engrams of all the books that Google scanned over many, many years of its ingestion processes. But it's really baking all of that information and then presenting it in some usable form, either through a system such as our ad system that takes models and decides what ads to show, or in a more direct form such as the engrams. Just to say, OK, here are those three broad classes-- ingestion, transaction processing, and analytical processing. To dig a little bit deeper into each of those areas, I would say the ingestion processes, especially the very large scale ingestion processes, are highly custom systems. If you think about our web crawlers, if you think about the Street View cars, if you think about maps stitching, or satellite imagery stitching-- those are very, very custom processes that I think, at least to this date, don't have a clear analog in the general industry. And maybe this is something that you guys might address or might see differently than how I see the version. They're still highly-specialized systems that produce very large images. And they're very high performance, very complex systems that are run by dedicated engineering teams. The transaction processing systems or the storage systems are things like the Google File System. These are things like Big Table. These are things like Megastore. Those are the ones that we've actually published papers about and that are now reasonably well known in the industry-- have evolved a little bit past the purely custom stage, where they're fairly general purpose. And there was a time at Google where actually most people did their own storage in some form or another, until these GFS-like systems evolved to the point where they were good enough that more than one team could use them. And actually, that evolution had many steps in which, for example, everybody ran their own GFS. And so maybe the ads team had their own GFS cells, and the search team maybe had their own GFS cells. And in time, the systems matured to the point where actually we could have a centrally-managed file system. And I think recently you may have seen, we've now talked about this global file system called Spanner which takes that to yet another level of transactions and global availability. And then the third step, which is I think still in a relatively immature stage compared to some of the storage systems, is the analysis. And I think a lot of people know about MapReduce and some of the systems that have been built on top of that. So for example, Flume is the way of chaining MapReduces in a more programmer-friendly way so that you don't end up with 50 MapReduce stages that are individually managed. But rather, you end up with one program that can then be pushed down into many MapReduces that are automatically managed. The process there is still very engineering focused and essentially requires engineering teams to process this large data.