Placeholder Image

Subtitles section Play video

  • SANDEEP PARIKH: Welcome, everybody.

  • Thanks for coming.

  • I know the day's been running a little bit long,

  • so I will hopefully not keep you too long.

  • Welcome to the session between you and happy hour.

  • Hopefully, we'll get you out of here just in time

  • to get some drinks and get some snacks.

  • My name is Sandeep Parikh.

  • I run the cloud solutions architecture

  • team for Americas East, for Google Cloud Platform,

  • and today, I want to talk to you about building

  • a collaborative data platform.

  • Effectively, you've got teams of individuals spread

  • across your company, that have to work together

  • in some meaningful way, and you need

  • them to share data, talk to each other,

  • basically build something together.

  • So, how do you enable that?

  • So, we're going to walk through some of the ways

  • that we tell a lot of customers, and developers, and partners

  • how to do that stuff in the real world.

  • First, we're going to meet our hypothetical team

  • of individuals, and we're going to learn

  • about the tasks that have to do, and the tools that they

  • typically use.

  • And then, we'll jump into mapping

  • the tools they use to the tools available in Google Cloud,

  • so you can start to see how they might

  • get a consistent set of tools and work together.

  • And then, we're going to talk about how

  • we enable collaboration.

  • This is how we do things, like set up teams

  • to do self-service and easy data discovery.

  • I'm probably going to trivialize a handful of details,

  • but stay with me, it'll make sense as we get through it.

  • And then, I want to cover what the new workflow's

  • like once you get all this stuff up and running,

  • and then give you a couple little things

  • and tips on how to get started.

  • Does that make sense to everybody?

  • Head nods?

  • OK, good.

  • All right, so let's meet the team.

  • All right, so first, we've got about four individuals.

  • We've got a data scientist, a data engineer, a business

  • analyst, and an app developer.

  • Ultimately, they all have three things in common.

  • They use different data sets, those data sets

  • are all in different formats, and they all

  • use different tools, right?

  • So, there's not a ton of overlap here,

  • and that's kind of the challenge.

  • So, let's assume for a second, or let's imagine for a second,

  • we've got this team, and their job

  • is to build a recommendation engine for a retail site.

  • So, what are the kinds of things that each person on the team

  • is going to need to be successful

  • in order to do their job?

  • So, the first thing we'll cover is the data,

  • and then we'll talk about the tools.

  • So, from a data perspective, Alice, the data scientist,

  • needs kind of cleansed, normalized data.

  • Things that have no longer log files, right,

  • not just raw rows from a database,

  • but things that actually make sense.

  • And those could be, for example, stuff

  • like purchase history, product history, product metadata,

  • click streams, product reviews.

  • All the stuff that it would take to actually craft and model

  • a recommendation problem.

  • Then, we've got Bob, the data engineer.

  • He's going to have to take all of the raw data

  • and turn it into something useful.

  • So, he's probably going to need things like,

  • logs information, product transactions or purchase

  • history, product metadata, as well.

  • This is all the stuff that's coming in

  • straight off the application, that he's

  • got to turn into something useful for other parts

  • of the organization.

  • Then, there's Ned.

  • Ned is the business analyst, and he

  • needs a lot of aggregated data, things

  • that he can use to generate statistics

  • and understanding about how the business is performing.

  • Not just, again, rows or log files, but again,

  • that next level up from, sort of, the basics.

  • And then, finally, we have Jan, the app developer.

  • She's going to need users and product predictions,

  • or recommendations.

  • She's going to be able to-- she needs

  • to be able to build new microservices that

  • expose things like this recommendation engine

  • to the end user.

  • So, that can be things like a recommendations API,

  • access to wish lists, some knowledge around similar users

  • or similar products, how things complement each other.

  • But, you can see there's a little bit of overlap here,

  • but not a ton, right.

  • They all need something different

  • and they kind of need it in a slightly different format.

  • Now, if you think about sort of a--

  • I tried to draw a sequence diagram here,

  • and it got a little complicated, so I simplified it

  • a little bit.

  • But, ultimately, Alice and Bob have to agree and talk

  • to each other.

  • Bob and Ned have to agree and talk to each other.

  • Ned and Alice have to talk each other,

  • and Jan and Alice have to talk to each other.

  • You know, there's a lot of communication happening.

  • And, listen, I love talking to my coworkers just as much

  • as the next person, but if you're

  • spending all of your time trying to explain to them what you

  • need, when you need it, where it came from,

  • what it should look like, it becomes challenging, right.

  • It slows down the ability for them

  • to make progress and work quickly.

  • Ultimately, this is the problem that we

  • see a lot of folks having out in the, world, right?

  • They each have to find data sets.

  • They each have to figure out who owns it.

  • They have to figure out how things are related

  • to each other, and they have to manage all that by themselves,

  • which, again, takes time away from their ultimate task.

  • Just by a show of hands, does this sound familiar to anybody?

  • Oh, that's a scary number, actually.

  • Maybe I shouldn't have asked that question.

  • No, but that this is great, right?

  • This is how we have to-- we have to find, understand,

  • what the problem is, then try to solve it together.

  • All right, so let's talk about tools.

  • That's the first step.

  • We need to understand what everyone's

  • working with in order to kind of make the next jump.

  • So, Alice is the data scientist, huge fan of Python,

  • like myself.

  • It's like her Swiss army knife, she uses it

  • for just about everything.

  • Typically, does a lot of her work

  • with IPython or Jupyter notebooks,

  • pulls data down, does it on her workstation,

  • you know, slices dices data, uses

  • a lot of pretty common data science frameworks,

  • uses things like, NumPy, SciPy, scikit-learn,

  • Pandas, because she has a history,

  • and/or likes data frames.

  • That's a typical approach, from a data science perspective.

  • Bob is a huge fan of Java, right?

  • He does lots of big ETL pipeline work,

  • does a lot of orchestrating data from one format to another,

  • or multiple sources of data together into something

  • that looks cohesive on the other end.

  • So, he's typically using tools like MapReduce or Spark.

  • In some cases, he might be using Apache Beam, which

  • you'll actually hear more about over the next couple days,

  • as well.

  • Ned is like a spreadsheet guy.

  • He loves SQL and he loves spreadsheets.

  • So, what he likes to do is build custom reports

  • and dashboards, very SQL-driven, got to have some kind of access

  • to a data warehouse, so he can do giant--

  • you know, kind of cube rotations and understanding.

  • Basically, he's got to be able to prove to the business

  • how products are performing, why things like a recommendation

  • engine is even necessary.

  • So, it's his responsibility to kind of facilitate that.

  • And then, we've got Jan, the app developer.

  • Jan's a typical kind of polyglot app developer.

  • Likes writing stuff on microservices, likes delivering

  • features very quickly, you know, may have a language preference

  • one way or the other, but has to maintain a lot of the existing

  • infrastructure.