Placeholder Image

Subtitles section Play video

  • SANDEEP PARIKH: Welcome, everybody.

  • Thanks for coming.

  • I know the day's been running a little bit long,

  • so I will hopefully not keep you too long.

  • Welcome to the session between you and happy hour.

  • Hopefully, we'll get you out of here just in time

  • to get some drinks and get some snacks.

  • My name is Sandeep Parikh.

  • I run the cloud solutions architecture

  • team for Americas East, for Google Cloud Platform,

  • and today, I want to talk to you about building

  • a collaborative data platform.

  • Effectively, you've got teams of individuals spread

  • across your company, that have to work together

  • in some meaningful way, and you need

  • them to share data, talk to each other,

  • basically build something together.

  • So, how do you enable that?

  • So, we're going to walk through some of the ways

  • that we tell a lot of customers, and developers, and partners

  • how to do that stuff in the real world.

  • First, we're going to meet our hypothetical team

  • of individuals, and we're going to learn

  • about the tasks that have to do, and the tools that they

  • typically use.

  • And then, we'll jump into mapping

  • the tools they use to the tools available in Google Cloud,

  • so you can start to see how they might

  • get a consistent set of tools and work together.

  • And then, we're going to talk about how

  • we enable collaboration.

  • This is how we do things, like set up teams

  • to do self-service and easy data discovery.

  • I'm probably going to trivialize a handful of details,

  • but stay with me, it'll make sense as we get through it.

  • And then, I want to cover what the new workflow's

  • like once you get all this stuff up and running,

  • and then give you a couple little things

  • and tips on how to get started.

  • Does that make sense to everybody?

  • Head nods?

  • OK, good.

  • All right, so let's meet the team.

  • All right, so first, we've got about four individuals.

  • We've got a data scientist, a data engineer, a business

  • analyst, and an app developer.

  • Ultimately, they all have three things in common.

  • They use different data sets, those data sets

  • are all in different formats, and they all

  • use different tools, right?

  • So, there's not a ton of overlap here,

  • and that's kind of the challenge.

  • So, let's assume for a second, or let's imagine for a second,

  • we've got this team, and their job

  • is to build a recommendation engine for a retail site.

  • So, what are the kinds of things that each person on the team

  • is going to need to be successful

  • in order to do their job?

  • So, the first thing we'll cover is the data,

  • and then we'll talk about the tools.

  • So, from a data perspective, Alice, the data scientist,

  • needs kind of cleansed, normalized data.

  • Things that have no longer log files, right,

  • not just raw rows from a database,

  • but things that actually make sense.

  • And those could be, for example, stuff

  • like purchase history, product history, product metadata,

  • click streams, product reviews.

  • All the stuff that it would take to actually craft and model

  • a recommendation problem.

  • Then, we've got Bob, the data engineer.

  • He's going to have to take all of the raw data

  • and turn it into something useful.

  • So, he's probably going to need things like,

  • logs information, product transactions or purchase

  • history, product metadata, as well.

  • This is all the stuff that's coming in

  • straight off the application, that he's

  • got to turn into something useful for other parts

  • of the organization.

  • Then, there's Ned.

  • Ned is the business analyst, and he

  • needs a lot of aggregated data, things

  • that he can use to generate statistics

  • and understanding about how the business is performing.

  • Not just, again, rows or log files, but again,

  • that next level up from, sort of, the basics.

  • And then, finally, we have Jan, the app developer.

  • She's going to need users and product predictions,

  • or recommendations.

  • She's going to be able to-- she needs

  • to be able to build new microservices that

  • expose things like this recommendation engine

  • to the end user.

  • So, that can be things like a recommendations API,

  • access to wish lists, some knowledge around similar users

  • or similar products, how things complement each other.

  • But, you can see there's a little bit of overlap here,

  • but not a ton, right.

  • They all need something different

  • and they kind of need it in a slightly different format.

  • Now, if you think about sort of a--

  • I tried to draw a sequence diagram here,

  • and it got a little complicated, so I simplified it

  • a little bit.

  • But, ultimately, Alice and Bob have to agree and talk

  • to each other.

  • Bob and Ned have to agree and talk to each other.

  • Ned and Alice have to talk each other,

  • and Jan and Alice have to talk to each other.

  • You know, there's a lot of communication happening.

  • And, listen, I love talking to my coworkers just as much

  • as the next person, but if you're

  • spending all of your time trying to explain to them what you

  • need, when you need it, where it came from,

  • what it should look like, it becomes challenging, right.

  • It slows down the ability for them

  • to make progress and work quickly.

  • Ultimately, this is the problem that we

  • see a lot of folks having out in the, world, right?

  • They each have to find data sets.

  • They each have to figure out who owns it.

  • They have to figure out how things are related

  • to each other, and they have to manage all that by themselves,

  • which, again, takes time away from their ultimate task.

  • Just by a show of hands, does this sound familiar to anybody?

  • Oh, that's a scary number, actually.

  • Maybe I shouldn't have asked that question.

  • No, but that this is great, right?

  • This is how we have to-- we have to find, understand,

  • what the problem is, then try to solve it together.

  • All right, so let's talk about tools.

  • That's the first step.

  • We need to understand what everyone's

  • working with in order to kind of make the next jump.

  • So, Alice is the data scientist, huge fan of Python,

  • like myself.

  • It's like her Swiss army knife, she uses it

  • for just about everything.

  • Typically, does a lot of her work

  • with IPython or Jupyter notebooks,

  • pulls data down, does it on her workstation,

  • you know, slices dices data, uses

  • a lot of pretty common data science frameworks,

  • uses things like, NumPy, SciPy, scikit-learn,

  • Pandas, because she has a history,

  • and/or likes data frames.

  • That's a typical approach, from a data science perspective.

  • Bob is a huge fan of Java, right?

  • He does lots of big ETL pipeline work,

  • does a lot of orchestrating data from one format to another,

  • or multiple sources of data together into something

  • that looks cohesive on the other end.

  • So, he's typically using tools like MapReduce or Spark.

  • In some cases, he might be using Apache Beam, which

  • you'll actually hear more about over the next couple days,

  • as well.

  • Ned is like a spreadsheet guy.

  • He loves SQL and he loves spreadsheets.

  • So, what he likes to do is build custom reports

  • and dashboards, very SQL-driven, got to have some kind of access

  • to a data warehouse, so he can do giant--

  • you know, kind of cube rotations and understanding.

  • Basically, he's got to be able to prove to the business

  • how products are performing, why things like a recommendation

  • engine is even necessary.

  • So, it's his responsibility to kind of facilitate that.

  • And then, we've got Jan, the app developer.

  • Jan's a typical kind of polyglot app developer.

  • Likes writing stuff on microservices, likes delivering

  • features very quickly, you know, may have a language preference

  • one way or the other, but has to maintain a lot of the existing

  • infrastructure.

  • So, that could be things like node apps.

  • That could be simple things like Python apps with Flask or Ruby

  • apps with Sinatra, all the way up

  • to like kitchen sink frameworks, like Django or Rails.

  • Just depends, right, whatever the use case is.

  • So, we've gotten that far.

  • So, we understand their task, we know

  • who all the key players are, and we know the things

  • that they like to use.

  • So, the next step for us is, how do we

  • figure out which tools they should

  • be using in Google Cloud?

  • All right, we can skip that slide.

  • So, what we'll do is let's lay out

  • a handful of the tools that are available.

  • And we're not going to cover everything that's

  • in GCP right now, because we don't have enough time,

  • and I want you guys to have snacks at some point soon.

  • So, what we'll do is kind of focus on the core set of things

  • that are critical for each of those individual roles

  • to have access to.

  • So, the first part, you know, for things like applications,

  • we've got Compute.

  • So, with things like virtual machine

  • or virtual instances with Compute Engine,

  • manage Kubernetes clusters with Container Engine,

  • or just kind of manage container deployments,

  • or Docker container deployments with App Engine.

  • It's a good place to start.

  • On the storage front, cloud storage for objects,

  • Cloud SQL for relational data, Cloud Datastore for, you know,

  • NoSQL, or non-relational datasets,

  • and Cloud Bigtable for wide column.

  • And then, on the data and analytics,

  • are then the more processing kind of side,

  • Dataproc for managed Hadoop and Spark clusters,

  • Datalab for IPython notebooks, BigQuery for kind

  • of an analytical data warehouse, and data flow for running

  • managed Beam, Apache Beam pipelines.

  • And then, the last column, or the last kind of group,

  • is machine learning.

  • So we've got cloud machine learning,

  • which is kind of running TensorFlow models,

  • and running them at scale, things

  • like the natural language API, the speech API, or the vision

  • API.

  • As you start to think about the team

  • that we just introduced, and you start

  • looking at this sort of bucket of tools, of this set of puzzle

  • pieces, you can start to see where things

  • are going to kind of fit together

  • There's a consistent set of tools

  • here that we can map back on to the team.

  • So let's walk through each one of those individually,

  • really quickly.

  • So, for data science workloads, Alice has a couple of options.

  • We already talked about the fact that she's a fan of Python,

  • so that maps really well.

  • So, with Python, she has two options.

  • She can either use Cloud Datalab,

  • or she can use Cloud Dataproc with Jupyter.

  • With Datalab, what she gets is a complete Python dev

  • environment.

  • You know, it's bundled in with like NumPy, SciPy, Matplotlib,

  • so she can kind of kick off her work, and build those charts,

  • and build those, kind of understanding,

  • or on the data set as she would.

  • Additionally, on top of that though,

  • it's also got built-in support for TensorFlow and BigQuery.

  • This means she's got a complete environment to go and start

  • maybe prototyping models that she wants

  • to build with TensorFlow, if she's

  • trying to build a very compelling recommendation

  • mechanism.

  • Or, if she needs data that lives in BigQuery,

  • she can actually do inline SQL statements there, and pull data

  • back out, or offload queries to BigQuery, as well.

  • So, she's got a handful of options there.

  • The nice thing about Datalab is that it's based on Jupyter, so,

  • it's got a little bit of background.

  • So, it should look very familiar to her.

  • But it is somewhat constrained onto just Python.

  • If she's got more specific needs,

  • or wants to include additional frameworks or kernels,

  • then we have to look at something like Cloud Dataproc

  • plus Jupyter.

  • So, what you can do is you can spin up a Cloud Dataproc

  • cluster--

  • again, a managed YARN cluster, effectively--

  • and then have Jupyter pre-installed on there

  • and ready to go.

  • So, it takes about 90 seconds to fire up a Hadoop cluster,

  • anywhere from like three, to like

  • a couple of thousand nodes.

  • In this case, for Alice, I think just three nodes is probably

  • appropriate, and her goal is to just spin this up and get

  • Jupyter pre-installed.

  • Once she's got Jupyter, then she's

  • back to the exact same environment

  • she had on her laptop, support for over 80 different languages

  • and frameworks or kernels.

  • But, the nice thing also is built-in support

  • for PySpark and Spark MLlib.

  • So, you know, if you're kind of trying

  • to figure out where the machine learning line sort of falls,

  • there's definitely a handful of more sessions

  • you can attend around things like TensorFlow or on Dataproc.

  • What I would urge you to do is, if you've

  • got kind of an individual like this in your organization,

  • is have them explore both.

  • TensorFlow might be appropriate.

  • Spark MLlib might be appropriate.

  • And there should be a clean distinction

  • between what they can each do.

  • All right, on the data processing front, same thing.

  • Bob's got a couple of different options.

  • He can either use Cloud Dataproc or he can use Cloud Dataflow.

  • With Dataproc, again, we just talked about this briefly,

  • but managed Hadoop and Spark clusters,

  • the nice thing about Dataproc--

  • and I'm only going to spend a minute on this,

  • so, by all means, if you're interested we can do questions

  • afterwards-- but the nice thing about Dataproc

  • is that it turns that kind of mindset of a Hadoop cluster

  • or a YARN cluster, on its head.

  • You end up going away from having a cluster-centric view

  • of the world, and turning it into more of a job-centric view

  • of the world.

  • So instead of firing up a thousand-node cluster

  • that everybody shares, you could have

  • every individual job have their own thousand-node cluster.

  • And as long as those jobs can be done in less than 24 hours,

  • you can actually take advantage of things

  • like preemptible OVMs or preemptible instances.

  • Those instances are about 80% off list price.

  • So, you get the ability to do this job dramatically faster

  • than ever before, because you have all this extra compute

  • sitting around, and you get to do it really cheaply,

  • because it only takes a few hours anyway.

  • So Bob's got that option.

  • He can also do things like tune the cluster parameters,

  • or if he's got custom JAR files he needs,

  • easily bootstrap those across the cluster, doesn't matter.

  • So, h has that option.

  • The other approach is to use Dataflow.

  • Dataflow is a managed service that we

  • have for running Apache Beam workloads.

  • Apache Beam is basically a programming model,

  • or an approach, that unifies batch and stream

  • processing into one.

  • When you take Beam workloads or Beam pipelines

  • and you push them to Cloud Dataflow, we run those for you

  • in a totally managed, kind of automatically scalable fashion.

  • So, that's pretty nice.

  • And Apache Beam workload's on Cloud Dataflow,

  • actually support templates for kind

  • of like easy parameterization and staging

  • and things like that.

  • The way I kind think about the relationship here of which path

  • you go down is really a question of what

  • do you have existing already.

  • If you've got a huge investment in Spark or MapReduce jobs,

  • or just kind of like oozy workflows around those things,

  • by all means, go down to Cloud Dataproc group.

  • It's a turnkey solution.

  • You can take your job and just push it to a new cluster,

  • just like that.

  • If it's net new and you're starting from scratch,

  • I think Beam is a good approach to look at.

  • It's relatively new, so it's a little bit young.

  • It's definitely not as mature as some of the other components

  • in the Hadoop ecosystem.

  • But, it does have this really, really,

  • really critical advantage where you can take a batch pipeline

  • and turn it into a streaming pipeline

  • just by changing a couple of lines of input.

  • So Bob's got a couple of options here,

  • to kind of match up to what he typically likes to work with.

  • All right, Ned's use case is actually really simple.

  • He needs a data warehouse.

  • He needs a data warehouse that supports SQL,

  • and that can scale with whatever size of data he needs,

  • or that he's got access to, and he's

  • got to be able to plug it into a whole host of tools,

  • kind of downstream.

  • So, BigQuery's a great fit for him,

  • enterprise cloud analytical data warehouse.

  • It's a fully managed mechanism.

  • We often use the word server-less,

  • though I hate that term, so I apologize.

  • It supports standard SQL and it does scale up

  • to a kind of petabyte scale.

  • So, he has the option of running something that's

  • anywhere from data sets that are about gigabytes all the way up

  • to petabytes, and still get responses back within seconds.

  • BigQuery is great, because it does

  • support kind of batch loading, or streaming inserts, as well.

  • And it's got built in things like security and durability

  • and automatic availability, and all the other good stuff.

  • But, ultimately, the best part is that Ned gets to use SQL,

  • gets to query really, really large data sets,

  • and visualize and explore the data as he's used to,

  • with the typical tools and expertise he's got.

  • When it comes time to create reports and dashboards

  • and things like that, there's a couple of options he has.

  • One, he can use Data Studio, which

  • is built into the platform.

  • So, it's kind of like a Google Doc style approach, where

  • I can create a new report, and then I

  • can share that with everybody in the company

  • without actually having to create multiple copies.

  • And people can edit things like parameters

  • and watch the difference in those reports,

  • and see how they look.

  • But effectively, he has the ability

  • to create those reports and dashboards.

  • Alternatively, he could use things like Tableau or Looker,

  • or other business intelligence tools

  • that he's that he's a fan of.

  • So, he's got a handful of options there.

  • And the nice thing also is that because of all this approach,

  • like BigQuery also supports other kind of JDBC

  • infrastructure, so a lot of the tooling that Ned is typically

  • used to can plug right into BigQuery, as well.

  • So, the last one is Jan.

  • We talked about this earlier.

  • Jan likes to deploy and scale microservices,

  • so she's got two easy options there.

  • If she's really, really focused on complex container

  • orchestration, writes them--

  • wants to deploy things that way, Container Engine's a great fit.

  • It's based on open sourced Kubernetes.

  • We've got built-in health checking, monitoring, logging.

  • We've got a private Container Registry,

  • or she can go down the app engine route,

  • and just take her Docker containers, and push them up,

  • and we'll auto-scale them for her.

  • The good thing is that along with kind of the same health

  • checking, monitoring, and logging,

  • App Engine also includes a built-in load balancer

  • up front, has things like version

  • management, traffic splitting, as well,

  • and automatic security scanning.

  • So, again, a couple of options here

  • that kind of make sense for her.

  • So, that's right.

  • We're all done.

  • Everybody's got tools, they can all work,

  • and everybody is off--

  • off to the races.

  • That's not true.

  • We've only just kind of scratched the surface.

  • So, they all have a consistent set of tools to work with,

  • but we actually haven't enabled any of the collaboration bits

  • yet.

  • So, we still have to figure out how to get them,

  • not only to work together, but actually

  • to work in a way that makes sense, and scales up.

  • Right.

  • Their team is four people today.

  • It could be eight, 12, 16, you know, in a few weeks.

  • So, we've got to come up with a better

  • approach to managing the data that they need access to.

  • So, if you're trying to enable collaboration,

  • there are a handful of things you're really

  • going to think about.

  • The first is, you've got to find consistency.

  • And it's not just consistency in the tool set, right?

  • That's also important, but you also need consistency

  • in terms of where do you expect data to be coming from,

  • where do you expect data to be used downstream, right?

  • That's really important.

  • If you can't agree on the sources,

  • if you can't agree on what the downstream workloads are,

  • it's hard to really understand how

  • everybody should work together and use the same set of tools.

  • The next thing you want to do, is take a really good hard look

  • at all the data that you're trying to get access to.

  • And this isn't true for every piece

  • of data that lives in your organization, right?

  • It might be things that are focused

  • on certain sets of the company, like certain teams,

  • or certain tasks, but ultimately, you've

  • got to figure out where all this data lives, and take

  • an inventory on it, right?

  • Take an inventory on it, and make

  • sure it's in the right storage medium for everyone

  • to use as time goes on.

  • The next thing you want to do is come up

  • with an approach around metadata, right?

  • This is pretty simple, but if you're

  • trying to figure out who owns a piece of data

  • or where it came from, what other data sets it's related

  • to, what are some of the types of data that are located

  • in this data set without actually having to go query it,

  • that's a really challenging problem to do,

  • when you think about having hundreds or thousands

  • of individual pieces of data spread out

  • across your infrastructure.

  • Then, you want to enable discovery and self-service.

  • You want people to be able to go find these things by themselves

  • and pull them down as needed, without having to go,

  • again, spend all that time arguing

  • with people about formats, and availability,

  • and tooling, and worrying about where you're going to store it,

  • as well.

  • And then the last thing is security,

  • right Don't leave that on the table.

  • It's certainly important to understand security, and more

  • broadly, identity, right, to make sure we're tracking access

  • to all these things.

  • All right, so how do we start with this?

  • I throw this in there.

  • You have to kind of embrace the concept of a data lake.

  • I'm not suggesting you have to go build one.

  • I know it's a totally loaded buzzword and term,

  • but you have to build--

  • you have to embrace the idea of it, on some level, right?

  • And this is kind of what I said earlier--

  • if you start at the very bottom, you first

  • have to understand what are all the sources of data I've got,

  • right?

  • At least get everybody to agree in the room

  • where data is coming in from, and what it looks like.

  • Once you do that, you want to find some, again, consistency

  • on which tools are we going to use to store data?

  • You're not going to consume every single part

  • of the platform, right?

  • It doesn't make sense to just use every service

  • because it's there.

  • Find the ones that make the most sense for the data

  • that you've got.

  • And we'll go through that a second, as well.

  • And the last thing is, what are the use cases, right?

  • How is the data going to get used over time, right?

  • It's important that you understand what those use

  • cases are, because they're going to drive back to the data--

  • or to the storage mediums.

  • And it's important to pick the storage mediums,

  • because that's going to depend pretty heavily

  • on how the data comes in, where it comes from,

  • and what it looks like.

  • All right, so where should data live, right?

  • We know the sources, we know the workloads,

  • but what do we do in the middle there?

  • How do we figure out which places to put data?

  • So this is kind of a simple decision tree.

  • And I'll go into a little bit more depth on the next slide

  • as well.

  • But some of this you can kind of simplify, right?

  • If it's really structured data, you're

  • kind of down to two options.

  • Is it sort of OLTP data, or is that OLAP data, right?

  • Is it transactional, or is that analytical?

  • If it's transactional, Cloud SQL's a great fit, right?

  • It's a typical relational database, no frills,

  • does what it does, and it does it really well.

  • On the analytical side, you have BigQuery, right?

  • So it depends on what the use case

  • is that's starting to drive it-- not just the structure,

  • but with the use cases.

  • The next column is semi-structure.

  • So, you've got something that you might know the schema for,

  • but it could change.

  • It could adapt in flight.

  • People might be deploying new mobile apps

  • somewhere that customers are going to use.

  • They're going to start capturing data

  • they weren't capturing before.

  • All those things are possible.

  • So, if you've got semi-structured data,

  • then it's a question of, how do I need to query that data?

  • Again, we're back to what is the downstream use case?

  • If it's something where you need to query in any possible field

  • that you write in, data source is a good choice.

  • When you write a piece of JSON to Cloud Datastore,

  • we automatically index every key by default.

  • That's part of that JSON.

  • Now, you can turn that off, if you want to,

  • or you can pick the keys you want, but, ultimately, you

  • have the ability to query anything

  • that you've written in there.

  • Whereas when you put data in a Bigtable,

  • you're basically stuck with trying to have

  • to query by the row key.

  • And that's actually pretty powerful

  • for a lot of really great use cases,

  • especially like time series data or transactional data.

  • But it's not a great fit, again, if you

  • want to try to query some random column,

  • you know, somewhere buried throughout the dataset.

  • And the last one is object, images media,

  • that sort of thing, that just go straight into cloud storage

  • so, it's a pretty simple approach.

  • So, if you break this out a little bit

  • and start getting in a few more examples,

  • you kind of end up with a chart that looks like this.

  • We just covered object storage really quickly a second

  • ago great for binary data, great for object data,

  • media, backups, that sort of thing.

  • On the non-relational side for Cloud Datastore,

  • it's really good for hierarchical data.

  • Again, think of like JSON as a good example

  • there, to fit in there, obviously,

  • great on like mobile applications,

  • that sort of thing.

  • On the Bigtable side, really, really powerful system

  • for heavy reads and writes, but you have the row key only,

  • as your only option.

  • There's a little bit of filtering

  • you can apply outbound on a query,

  • but really, it's driven by the row key.

  • So you've got a very specific workload

  • you can use at a Bigtable on the relational side you

  • have two options.

  • And I mentioned-- and I didn't mention Spanner yet,

  • and it's still a little bit early on Spanner,

  • but I did want to put it into context for kind of,

  • for everybody in the room.

  • Cloud SQL's great for web frameworks.

  • Typical web applications that you're going to build,

  • typical CRED applications are great.

  • Spanner's really interesting.

  • It's new for us.

  • It's still early.

  • Spanner hasn't gone into general availability quite yet,

  • but it's an interesting thing to keep an eye on,

  • especially as people are building kind

  • of global-facing applications, where their customers could be

  • just about anywhere, and having a mechanism that

  • has a globally distributed SQL infrastructure is really

  • powerful.

  • So, it's something to keep your eye on as you kind of make

  • more progress with GCP.

  • And the last one is warehouse data.

  • Data warehouse, BigQuery, that's the right place to put that.

  • So, this is where it gets interesting.

  • You found the tools, you found the sources,

  • and you found the workloads.

  • How do you continue on this path of enabling self-service

  • and discovery?

  • So, you've taken all the data sets.

  • We've taken inventory of all of them.

  • They're all located in the right places,

  • but there might be a lot of it.

  • So how do we enable people to actually find

  • the things they're looking for over time?

  • And this is where metadata is really important,

  • because I think if you guys can kind of guess where I'm going,

  • ultimately, what we're going to do

  • is we're going to try to build a catalog,

  • and that catalog is what's going to drive everyone's usage

  • downstream, and everyone's ability to work on their own.

  • So, in order to build the catalog,

  • though, you've got to agree on metadata.

  • And actually, as it turns out, fortuitously,

  • or fortunately I should say, the Google research team actually

  • just published an interesting blog post

  • about facilitating discovery of public data sets.

  • And in it they cover things like ownership, provenance,

  • the type of data that's located--

  • that's contained within the data set.

  • They relationships between various data sets,

  • can you get consistent representations,

  • can you standardize some of the descriptive tools

  • that we use to do it.

  • It's basically just JSON.

  • There's nothing fancy about this,

  • but it forces everyone to come up and say,

  • I have a consistent representation

  • of every single data set.

  • So, you start to imagine, if you have 100 data sets strewn

  • across the company, you're only saying to everybody,

  • if you are an owner of a data set,

  • just publish this small little JSON file,

  • and hand it over to us, and now we

  • can start cataloging this data.

  • That's really powerful.

  • What's even better is, if you've got that JSON data,

  • we already had a place to put it.

  • So, you have two options here.

  • You can either push that JSON data into Cloud Datastore

  • or into BigQuery.

  • And I'll kind of cover the difference here.

  • Cloud Datastores are a great place for this data,

  • for a variety of reasons.

  • Again, JSON's really well represented.

  • You can query about any possible field there.

  • But the downside is, that Datastore doesn't really have

  • kind of a good user-facing UI.

  • It's very much application-centric.

  • So, if you want to go ahead and build a little CRED

  • API on top of this thing, Datastore can be a good fit.

  • The other option is, frankly BigQuery.

  • You know, same idea around Datastore and storing JSON,

  • in fact, this is an example of-- this screenshot as an example

  • of what this JSON looks like in BigQuery,

  • because it does support nested columns.

  • BigQuery is great because, you've got a UI attached to it.

  • There's a console right there, so you can actually

  • run SQL queries on it.

  • SQL's relatively universal across a lot of folks.

  • They understand it really easily.

  • So, this might make a little bit more sense.

  • To be totally fair and totally honest with you guys,

  • when I typically recommend this approach,

  • I often will push people to BigQuery,

  • as a place to build this data catalog around,

  • because it just makes sense, and it plugs

  • into a lot of downstream tools.

  • Datastore can make sense, but it's

  • very, very particularly dependent on how

  • the team wants to work.

  • For trying to find kind of a good generic solution

  • to start with, BigQuery's a great fit.

  • So, once you've done that, you've pushed--

  • we've inventoried, we've cataloged, we've got metadata,

  • and we've got everything living inside of BigQuery now.

  • So, how does this change the workflow?

  • I'm only going to pick on one, because they're all

  • going to start looking a little bit similar, if I

  • keep eating through it.

  • But, let's talk about the data science workflow.

  • So, for Alice, now instead of having to go talk to Bob,

  • or go talk to Ned, or go talk to Jan,

  • or talk to anybody else in the company

  • about what data is out there that she can use,

  • the first thing she can do is start to query the catalog.

  • So, she can just drop into the BigQuery UI,

  • and run a simple SQL query, and explore the data sets

  • that she has access to.

  • The next thing she can do, because part of what's

  • in there, one of the pieces of metadata, is what is the URL,

  • or what is the download location for this,

  • she can go ahead and pull that data into her environment.

  • Again, whether she's running a Cloud Datalab

  • notebook, or a Jupyter notebook, she

  • can pull that data into her environment.

  • And then, she can start to prototype a TensorFlow model,

  • for example.

  • She could start building a little bit of a TensorFlow

  • model and running some early analysis

  • of measuring how well her recommendation engine might

  • be working.

  • Once she does that, she might have created some new datasets.

  • She might have actually taken something and said,

  • now I actually have training data and test

  • data to work against.

  • So, now that she's created two new datasets,

  • the next thing she's going to do is upload those to the catalog.

  • Write the metadata, push that in,

  • so now other people that might want to build this stuff later

  • on, have access to the same training and testing

  • that she used.

  • So, we're starting to create this kind of lifecycle around.

  • Making sure that if you create something new, make sure

  • it gets shared with everybody as quickly as possible.

  • And then, she's going to continue on her work

  • like she normally does.

  • She'll train her machine-learning models.

  • She might push that TensorFlow model

  • to, like, the cloud ML service.

  • And then, because the cloud ML service that she created

  • is actually a data kind of a resource,

  • she can actually create a catalog entry for that service.

  • So, now if someone says, I want to find a recommendation

  • service within the catalog, as long as she's

  • tagged and labeled everything appropriately,

  • they could find the URL for that service immediately.

  • And this is a bit of a simplification,

  • or oversimplification, of the amount of work

  • it takes to do all this.

  • But, if you start to extrapolate these steps out,

  • she's been able to work at her own pace

  • without having to worry about trying to find who owns what,

  • who has access to what she has access to, because we've

  • built this data catalog.

  • We've built this approach to metadata.

  • Over time, we're kind of getting to this point where

  • we can start building again toward self-service.

  • So, just like we have a SQL-- a great SQL interface

  • for querying that data catalog, you

  • might want to continue down the self-service path and say,

  • can we make these things even more discoverable?

  • Can we put like a CRED API on top of this?

  • So, one option is to take a small application or small CRED

  • API that sits in front of this BigQuery metadata catalog

  • and deploy that in, like, Compute Engine, or App

  • Engine, or Container Engine, and front

  • it with, like, cloud endpoints.

  • The reason you might go down this road

  • is particularly around good API management,

  • and building-- again, consistent and clear access.

  • Because the best way you can enable sort of self-service

  • is, obviously, giving everybody access to it,

  • but also giving everybody access to

  • in a way that is most beneficial to them.

  • So, if you've got a lot of folks who are very API-driven,

  • or want to build new applications all

  • the time, having an API endpoint that they can

  • hit to learn more about the data catalog

  • is very, very beneficial.

  • And over time, you can actually start

  • to extrapolate this step even further,

  • and start fronting all of the data sources you've got.

  • So, not only do you have a CRED API in front of the catalog,

  • but you might actually have one in front

  • of every single dataset.

  • So now, again, you're enabling this further, you know,

  • deeper level of access to data.

  • And this might be a little bit overkill

  • for the team of four people, but imagine

  • if that was a team of 400 or 500 people.

  • As you think about this kind of approach

  • permeating the entire organization,

  • everyone wants to have access, and you've

  • got to start building these things for scale over time.

  • So, picking the right tools up front lets you adopt that,

  • and again, lets new teams pick this approach up, as well.

  • Before we finish this, I do want to talk

  • a little bit about security, early security and identity,

  • in kind of a broad sense.

  • We've got a great set of identity and access management

  • tools called Cloud IAM.

  • There's a ton of sessions about it

  • throughout the next couple of days,

  • so I urge you guys, if you're interested to go dig into it.

  • What I really want to cover here though,

  • is this idea that there's policy inheritance that

  • goes from the top all the way to the bottom.

  • That means, at any level, if you set a policy on data access

  • or control, it does filter all the way down.

  • So, if you've got an organization

  • or a single project that's kind of the host

  • project in your project account with GCP,

  • and you've got individual projects for maybe dev

  • staging, Q&A, that sort of thing, or production,

  • and then you have individual resources

  • underneath those projects, you can actually

  • control who has access to what, as long as you've set

  • your organization up correctly.

  • I'm not going to go too deep here,

  • but, basically, the idea is to walk away

  • with is, we have the ability to control who can see what,

  • and who can pull things down, and that's really

  • what you want to control.

  • And, for example, if you dig into BigQuery a little bit,

  • you've got a handful of roles and different responsibilities,

  • or different access controls that they've

  • got based on that stuff.

  • So, as you look at what kind of what--

  • if you look at and think about what it's

  • going to take to build for the future,

  • I put the site up again, because it's really important, if you

  • do nothing else, if you build nothing else,

  • you have to get agreement.

  • That's the most important thing.

  • If you can't adopt this idea that we

  • have to have consistency around data sources, data

  • workloads, and then the tools we're

  • going to use to work with that data,

  • this is going to be a long road.

  • As much as I talk about the products and the blue hexagons

  • and stuff, a lot of this is a cultural shift

  • in an organization.

  • It's kind of a lifestyle change.

  • You have to get everybody on the same page.

  • And part of doing that is saying, can we add consistency?

  • Or can we make this a consistent view of the world

  • that we can all agree upon?

  • And this might be a small subset of what you end up with.

  • It could be a much, much larger picture with a lot more pieces

  • to it, but it's important that everybody agrees, again,

  • what the data is going to look like coming in,

  • what it's going to get use for on the way out,

  • and where it's going to live while it's

  • sitting in your infrastructure.

  • As you think about--

  • as you kind of get through that consistency piece--

  • and, you know, that's a tough road,

  • but once you get through that, then the next step

  • is really getting around to some building the catalog.

  • Can you catalog all the data sets?

  • Do you know exactly what lives throughout the organization?

  • Can you get everyone to kind of pony up

  • and say, all you have to do is write this little snippet

  • of JSON, and we'll be able to leave you alone

  • for the next few days.

  • If you can build the catalog, that's a great first step.

  • Then the next thing you want to do, is

  • make sure your teams have access to the tools they

  • need to do this work.

  • And this is not just the tools that are in GCP,

  • but it's also, like, the API access, or the SQL

  • access to the datasets.

  • Can they get those things?

  • Have you set up identity and understanding security

  • correctly, so that everybody has access?

  • Because what you want people to be able to do

  • is work on their own, without having to go

  • bother anyone else.

  • If they can do everything they need

  • to do without interacting with somebody else,

  • then when they do interact, and they go have lunch together,

  • it's a lot friendlier conversation, as

  • opposed to arguing about who has access to data.

  • [MUSIC PLAYING]

SANDEEP PARIKH: Welcome, everybody.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it