Placeholder Image

Subtitles section Play video

  • CLEMENS MEWALD: Hi, everyone.

  • My name is Clemens.

  • I'm a product manager in Google Research.

  • And today I'm going to talk about TensorFlow Extended,

  • which is a machine learning platform that we built

  • around TensorFlow at Google.

  • And I'd like to start this talk with a block

  • diagram and the small yellow box, or orange box.

  • And that box basically represents

  • what most people care about and talk about when they

  • talk about machine learning.

  • It's the machine learning algorithm.

  • It's the structure of the network

  • that you're training, how you choose

  • what type of machine learning problem you're solving.

  • And that's what you talk about when you talk about TensorFlow

  • and using TensorFlow.

  • However, in addition to the actual machine learning,

  • and to TensorFlow itself, you have

  • to care about so much more.

  • And these are all of these other things

  • around the actual machine learning algorithm

  • that you have to have in place, and that you actually

  • have to nail and get right in order

  • to actually do machine learning in a production setting.

  • So you have to care about where you get your data from,

  • that your data are clean, how you transform them,

  • how you train your model, how to validate your model,

  • how to push it out into a production setting,

  • and deploy it at scale.

  • Now, some of you may be thinking, well,

  • I don't really need all of this.

  • I only have my small machine learning problem.

  • I can live within that small orange box.

  • And I don't really have these production worries as of today.

  • But I'm going to propose that all of you

  • will have that problem at some point in time.

  • Because what I've seen time and time

  • again is that research and experimentation

  • today is production tomorrow.

  • It's like research and experimentation never

  • ends just there.

  • Eventually it will become a production model.

  • And at that point, you actually have to care about all

  • of these things.

  • Another side of this coin is scale.

  • So some of you may say, well, I do

  • all of my machine learning on a local machine, in a notebook.

  • Everything fits into memory.

  • I don't need all of these heavy tools to get started.

  • But similarly, small scale today is large scale tomorrow.

  • At Google we have this problem all the time.

  • That's why we always design for scale from day one,

  • because we always have product teams that say, well,

  • we have only a small amount of data.

  • It's fine.

  • But then a week later the product picks up.

  • And suddenly they need to distribute the workload

  • to hundreds of machines.

  • And then they have all of these concerns.

  • Now, the good news is that we built something for this.

  • And TFX is the solution to this problem.

  • So this is a block diagram that we published

  • in one of our papers that is a very simplistic view

  • of the platform.

  • But it gives you a broad sense of what

  • the different components are.

  • Now, TFX is a very large platform.

  • And it contains a lot of components

  • and a lot of services.

  • So the paper that we published, and also

  • what I'm going to discuss today, is only a small subset of this.

  • But building TFX and deploying it at Google

  • has had a profound impact of how fast product teams at Google

  • can train machine learning models

  • and deploy them in production, and how ubiquitous machine

  • learning has become at Google.

  • You'll see later I have a slide to give you some sense of how

  • widely TFX is being used.

  • And it really has accelerated all of our efforts

  • to being an AI first company and using machine learning

  • in all of our products.

  • Now, we use TFX broadly at Google.

  • And we are very committed to make

  • all of this available to you through open sourcing it.

  • So the boxes that are just highlighted in blue

  • are the components that we've already open sourced.

  • Now, I want to highlight an important thing.

  • TFX is a real solution for real problems.

  • Sometimes people ask me, well, is this the same code that you

  • use at Google for production?

  • Or did you just build something on the side and open source it?

  • And all of these components are the same code base

  • that we use internally for our production pipelines.

  • Of course, there's some things that

  • are Google specific for our deployments.

  • But all of the code that we open source

  • is the same code that we actually

  • run in our production systems.

  • So it's really code that solves real problems for Google.

  • The second part to highlight is so far

  • we've only open sourced libraries, so

  • each one of these libraries that you can use.

  • But you still have to glue them together.

  • You still have to write some code

  • to make them work in a joint manner.

  • That's just because we haven't open

  • sourced the full platform yet.

  • We're actively working on this.

  • But I would say so far we're about 50% there.

  • So these blue components are the ones

  • that I'm going to talk about today.

  • But first, let me talk about some of the principles

  • that we followed when we developed TFX.

  • Because I think it's very informative

  • to see how we think about these platforms,

  • and how we think about having impact at Google.

  • The first principle is flexibility.

  • And there's some history behind this.

  • And the short version of that history

  • is that I'm sure at other companies as well there used

  • to be problem specific machine learning platforms.

  • And just to be concrete, so we had a platform

  • that was specifically built for large scale linear models.

  • So if you had a linear model that you

  • wanted to train at large scale, you

  • used this piece of infrastructure.

  • We had a different piece of infrastructure

  • for large scale neural networks.

  • But product teams usually don't have one kind of a problem.

  • And they usually want to train multiple types of models.

  • So if they wanted to train linear [INAUDIBLE] models,

  • they had to use two entirely different technology stacks.

  • Now, with TensorFlow, as I'm sure you know,

  • we can actually express any kind of machine learning algorithm.

  • So we can train TensorFlow models

  • that are linear, that are deep, unsupervised and supervised.

  • We can train tree models.

  • And any single algorithm that you can think of either

  • has already been implemented in TensorFlow,

  • or is possible to be implemented in TensorFlow.

  • So building on top of that flexibility,

  • we have one platform that supports

  • all of these different use cases from all of our users.

  • And they don't have to switch between platforms just

  • because they want to implement different types of algorithms.

  • Another aspect of this is the input data.

  • Of course, also product teams don't only have image data,

  • or only have text data.

  • In some cases, they may even have both.

  • Right.

  • So they have models that take in both images and text,

  • and make a prediction.

  • So we needed to make sure that the platform that we built

  • supports all of these input modalities,

  • and can deal with images, text, sparse data

  • that you will find in logs, videos even.

  • And with a platform as flexible as this,

  • you can ensure that all of the users

  • can represent all of their use cases on the same platform,

  • and don't have to adopt different technologies.

  • The next aspects of flexibility is

  • how you actually run these pipelines

  • and how you train models.

  • So one very basic use case is you have all of your data

  • available.

  • You train your model once, and you're done.

  • This works really well for stationary problems.

  • A good example is always, you want

  • to train a model that classifies an image whether there's

  • a cat or a dog in that image.

  • Cats and dogs have looked the same for quite a while.

  • And they will look the same in 10 years,

  • or very much the same as today.

  • So that same model will probably work well in a couple of years.

  • So you don't need to keep that model fresh.

  • However, if you have a non stationary problem where

  • data changes over time, recommendation systems

  • have new types of products that you want to recommend,

  • new types of videos that get uploaded all the time, you

  • actually have to retrain these models, or keep them fresh.

  • So one way of doing this is to train a model

  • on a subset of your data.

  • Once you get new data, you throw that away.

  • You train a new model either on the superset, so on the old

  • and on the new data, or only on the fresh data, and so on.

  • Now, that has a couple of disadvantages.

  • One of them being that you throw away

  • learning from previous models.

  • In some cases, you're wasting resources,

  • because you actually have to retrain over the same data over

  • and over again.

  • And because a lot of these models

  • are actually not deterministic, you

  • may end up with vastly different models every time.

  • Because the way that they're being initialized,

  • you may end up in different optimum

  • every time you train these models.

  • So a more advanced way of doing this

  • is to start training with your data.

  • And then initialize your model from the previous weights

  • from these models and continue training.

  • So we call that warm starting of models that may seem trivial

  • if you just say, well, this is just

  • a continuation of your training run.

  • You just added more data and you continue.

  • But depending on your model architecture,

  • it's actually non-trivial.

  • Which in some cases, you may only

  • want to warm start embeddings.

  • So you may only want to transfer the weights of the embeddings

  • to a new model and initialize the rest of your network

  • randomly.

  • So there's a lot of different setups

  • that you can achieve with this.

  • But with this you can continuously

  • update your models.

  • You retain the learning from previous versions.

  • You can even, depending on how you set it up,

  • bias your model more on the more recent data.

  • But you're still not throwing away the old data.

  • And always have a fresh model that's updated for production.

  • The second principle is portability.

  • And there's a few aspects to this.

  • The first one is obvious.

  • So because we rely on TensorFlow,

  • we inherit the properties of TensorFlow,

  • which means you can already train your TensorFlow

  • models in different environments and on different machines.

  • So you can train a TensorFlow model locally.

  • You can distribute it in a cloud environment.

  • And by cloud, I mean any setup of multiple clusters.

  • It doesn't have to be a managed cloud.

  • You can train or perform inferences with your TensorFlow

  • models on the devices that you care about today.

  • And you can also train and deploy them on devices

  • that you may care about in the future.

  • Next is Apache Beam.

  • So when we open sourced a lot of our components

  • we faced the challenge that internally we

  • use a data processing engine that

  • allows us to run these large scale

  • data processing pipelines.

  • But in the open source world and in all of your companies,

  • you may use different data processing systems.

  • So we were looking for a portability layer.

  • And Apache beam provides us with that portability layer.