Subtitles section Play video Print subtitles SANDEEP PARIKH: Welcome, everybody. Thanks for coming. I know the day's been running a little bit long, so I will hopefully not keep you too long. Welcome to the session between you and happy hour. Hopefully, we'll get you out of here just in time to get some drinks and get some snacks. My name is Sandeep Parikh. I run the cloud solutions architecture team for Americas East, for Google Cloud Platform, and today, I want to talk to you about building a collaborative data platform. Effectively, you've got teams of individuals spread across your company, that have to work together in some meaningful way, and you need them to share data, talk to each other, basically build something together. So, how do you enable that? So, we're going to walk through some of the ways that we tell a lot of customers, and developers, and partners how to do that stuff in the real world. First, we're going to meet our hypothetical team of individuals, and we're going to learn about the tasks that have to do, and the tools that they typically use. And then, we'll jump into mapping the tools they use to the tools available in Google Cloud, so you can start to see how they might get a consistent set of tools and work together. And then, we're going to talk about how we enable collaboration. This is how we do things, like set up teams to do self-service and easy data discovery. I'm probably going to trivialize a handful of details, but stay with me, it'll make sense as we get through it. And then, I want to cover what the new workflow's like once you get all this stuff up and running, and then give you a couple little things and tips on how to get started. Does that make sense to everybody? Head nods? OK, good. All right, so let's meet the team. All right, so first, we've got about four individuals. We've got a data scientist, a data engineer, a business analyst, and an app developer. Ultimately, they all have three things in common. They use different data sets, those data sets are all in different formats, and they all use different tools, right? So, there's not a ton of overlap here, and that's kind of the challenge. So, let's assume for a second, or let's imagine for a second, we've got this team, and their job is to build a recommendation engine for a retail site. So, what are the kinds of things that each person on the team is going to need to be successful in order to do their job? So, the first thing we'll cover is the data, and then we'll talk about the tools. So, from a data perspective, Alice, the data scientist, needs kind of cleansed, normalized data. Things that have no longer log files, right, not just raw rows from a database, but things that actually make sense. And those could be, for example, stuff like purchase history, product history, product metadata, click streams, product reviews. All the stuff that it would take to actually craft and model a recommendation problem. Then, we've got Bob, the data engineer. He's going to have to take all of the raw data and turn it into something useful. So, he's probably going to need things like, logs information, product transactions or purchase history, product metadata, as well. This is all the stuff that's coming in straight off the application, that he's got to turn into something useful for other parts of the organization. Then, there's Ned. Ned is the business analyst, and he needs a lot of aggregated data, things that he can use to generate statistics and understanding about how the business is performing. Not just, again, rows or log files, but again, that next level up from, sort of, the basics. And then, finally, we have Jan, the app developer. She's going to need users and product predictions, or recommendations. She's going to be able to-- she needs to be able to build new microservices that expose things like this recommendation engine to the end user. So, that can be things like a recommendations API, access to wish lists, some knowledge around similar users or similar products, how things complement each other. But, you can see there's a little bit of overlap here, but not a ton, right. They all need something different and they kind of need it in a slightly different format. Now, if you think about sort of a-- I tried to draw a sequence diagram here, and it got a little complicated, so I simplified it a little bit. But, ultimately, Alice and Bob have to agree and talk to each other. Bob and Ned have to agree and talk to each other. Ned and Alice have to talk each other, and Jan and Alice have to talk to each other. You know, there's a lot of communication happening. And, listen, I love talking to my coworkers just as much as the next person, but if you're spending all of your time trying to explain to them what you need, when you need it, where it came from, what it should look like, it becomes challenging, right. It slows down the ability for them to make progress and work quickly. Ultimately, this is the problem that we see a lot of folks having out in the, world, right? They each have to find data sets. They each have to figure out who owns it. They have to figure out how things are related to each other, and they have to manage all that by themselves, which, again, takes time away from their ultimate task. Just by a show of hands, does this sound familiar to anybody? Oh, that's a scary number, actually. Maybe I shouldn't have asked that question. No, but that this is great, right? This is how we have to-- we have to find, understand, what the problem is, then try to solve it together. All right, so let's talk about tools. That's the first step. We need to understand what everyone's working with in order to kind of make the next jump. So, Alice is the data scientist, huge fan of Python, like myself. It's like her Swiss army knife, she uses it for just about everything. Typically, does a lot of her work with IPython or Jupyter notebooks, pulls data down, does it on her workstation, you know, slices dices data, uses a lot of pretty common data science frameworks, uses things like, NumPy, SciPy, scikit-learn, Pandas, because she has a history, and/or likes data frames. That's a typical approach, from a data science perspective. Bob is a huge fan of Java, right? He does lots of big ETL pipeline work, does a lot of orchestrating data from one format to another, or multiple sources of data together into something that looks cohesive on the other end. So, he's typically using tools like MapReduce or Spark. In some cases, he might be using Apache Beam, which you'll actually hear more about over the next couple days, as well. Ned is like a spreadsheet guy. He loves SQL and he loves spreadsheets. So, what he likes to do is build custom reports and dashboards, very SQL-driven, got to have some kind of access to a data warehouse, so he can do giant-- you know, kind of cube rotations and understanding. Basically, he's got to be able to prove to the business how products are performing, why things like a recommendation engine is even necessary. So, it's his responsibility to kind of facilitate that. And then, we've got Jan, the app developer. Jan's a typical kind of polyglot app developer. Likes writing stuff on microservices, likes delivering features very quickly, you know, may have a language preference one way or the other, but has to maintain a lot of the existing infrastructure.