Subtitles section Play video Print subtitles CLEMENS MEWALD: Hi, everyone. My name is Clemens. I'm a product manager in Google Research. And today I'm going to talk about TensorFlow Extended, which is a machine learning platform that we built around TensorFlow at Google. And I'd like to start this talk with a block diagram and the small yellow box, or orange box. And that box basically represents what most people care about and talk about when they talk about machine learning. It's the machine learning algorithm. It's the structure of the network that you're training, how you choose what type of machine learning problem you're solving. And that's what you talk about when you talk about TensorFlow and using TensorFlow. However, in addition to the actual machine learning, and to TensorFlow itself, you have to care about so much more. And these are all of these other things around the actual machine learning algorithm that you have to have in place, and that you actually have to nail and get right in order to actually do machine learning in a production setting. So you have to care about where you get your data from, that your data are clean, how you transform them, how you train your model, how to validate your model, how to push it out into a production setting, and deploy it at scale. Now, some of you may be thinking, well, I don't really need all of this. I only have my small machine learning problem. I can live within that small orange box. And I don't really have these production worries as of today. But I'm going to propose that all of you will have that problem at some point in time. Because what I've seen time and time again is that research and experimentation today is production tomorrow. It's like research and experimentation never ends just there. Eventually it will become a production model. And at that point, you actually have to care about all of these things. Another side of this coin is scale. So some of you may say, well, I do all of my machine learning on a local machine, in a notebook. Everything fits into memory. I don't need all of these heavy tools to get started. But similarly, small scale today is large scale tomorrow. At Google we have this problem all the time. That's why we always design for scale from day one, because we always have product teams that say, well, we have only a small amount of data. It's fine. But then a week later the product picks up. And suddenly they need to distribute the workload to hundreds of machines. And then they have all of these concerns. Now, the good news is that we built something for this. And TFX is the solution to this problem. So this is a block diagram that we published in one of our papers that is a very simplistic view of the platform. But it gives you a broad sense of what the different components are. Now, TFX is a very large platform. And it contains a lot of components and a lot of services. So the paper that we published, and also what I'm going to discuss today, is only a small subset of this. But building TFX and deploying it at Google has had a profound impact of how fast product teams at Google can train machine learning models and deploy them in production, and how ubiquitous machine learning has become at Google. You'll see later I have a slide to give you some sense of how widely TFX is being used. And it really has accelerated all of our efforts to being an AI first company and using machine learning in all of our products. Now, we use TFX broadly at Google. And we are very committed to make all of this available to you through open sourcing it. So the boxes that are just highlighted in blue are the components that we've already open sourced. Now, I want to highlight an important thing. TFX is a real solution for real problems. Sometimes people ask me, well, is this the same code that you use at Google for production? Or did you just build something on the side and open source it? And all of these components are the same code base that we use internally for our production pipelines. Of course, there's some things that are Google specific for our deployments. But all of the code that we open source is the same code that we actually run in our production systems. So it's really code that solves real problems for Google. The second part to highlight is so far we've only open sourced libraries, so each one of these libraries that you can use. But you still have to glue them together. You still have to write some code to make them work in a joint manner. That's just because we haven't open sourced the full platform yet. We're actively working on this. But I would say so far we're about 50% there. So these blue components are the ones that I'm going to talk about today. But first, let me talk about some of the principles that we followed when we developed TFX. Because I think it's very informative to see how we think about these platforms, and how we think about having impact at Google. The first principle is flexibility. And there's some history behind this. And the short version of that history is that I'm sure at other companies as well there used to be problem specific machine learning platforms. And just to be concrete, so we had a platform that was specifically built for large scale linear models. So if you had a linear model that you wanted to train at large scale, you used this piece of infrastructure. We had a different piece of infrastructure for large scale neural networks. But product teams usually don't have one kind of a problem. And they usually want to train multiple types of models. So if they wanted to train linear [INAUDIBLE] models, they had to use two entirely different technology stacks. Now, with TensorFlow, as I'm sure you know, we can actually express any kind of machine learning algorithm. So we can train TensorFlow models that are linear, that are deep, unsupervised and supervised. We can train tree models. And any single algorithm that you can think of either has already been implemented in TensorFlow, or is possible to be implemented in TensorFlow. So building on top of that flexibility, we have one platform that supports all of these different use cases from all of our users. And they don't have to switch between platforms just because they want to implement different types of algorithms. Another aspect of this is the input data. Of course, also product teams don't only have image data, or only have text data. In some cases, they may even have both. Right. So they have models that take in both images and text, and make a prediction. So we needed to make sure that the platform that we built supports all of these input modalities, and can deal with images, text, sparse data that you will find in logs, videos even. And with a platform as flexible as this, you can ensure that all of the users can represent all of their use cases on the same platform, and don't have to adopt different technologies. The next aspects of flexibility is how you actually run these pipelines and how you train models. So one very basic use case is you have all of your data available. You train your model once, and you're done. This works really well for stationary problems. A good example is always, you want to train a model that classifies an image whether there's a cat or a dog in that image. Cats and dogs have looked the same for quite a while. And they will look the same in 10 years, or very much the same as today. So that same model will probably work well in a couple of years. So you don't need to keep that model fresh. However, if you have a non stationary problem where data changes over time, recommendation systems have new types of products that you want to recommend, new types of videos that get uploaded all the time, you actually have to retrain these models, or keep them fresh. So one way of doing this is to train a model on a subset of your data. Once you get new data, you throw that away. You train a new model either on the superset, so on the old and on the new data, or only on the fresh data, and so on. Now, that has a couple of disadvantages. One of them being that you throw away learning from previous models. In some cases, you're wasting resources, because you actually have to retrain over the same data over and over again. And because a lot of these models are actually not deterministic, you may end up with vastly different models every time. Because the way that they're being initialized, you may end up in different optimum every time you train these models. So a more advanced way of doing this is to start training with your data. And then initialize your model from the previous weights from these models and continue training. So we call that warm starting of models that may seem trivial if you just say, well, this is just a continuation of your training run. You just added more data and you continue. But depending on your model architecture, it's actually non-trivial. Which in some cases, you may only want to warm start embeddings. So you may only want to transfer the weights of the embeddings to a new model and initialize the rest of your network randomly. So there's a lot of different setups that you can achieve with this. But with this you can continuously update your models. You retain the learning from previous versions. You can even, depending on how you set it up, bias your model more on the more recent data. But you're still not throwing away the old data. And always have a fresh model that's updated for production. The second principle is portability. And there's a few aspects to this. The first one is obvious. So because we rely on TensorFlow, we inherit the properties of TensorFlow, which means you can already train your TensorFlow models in different environments and on different machines. So you can train a TensorFlow model locally. You can distribute it in a cloud environment. And by cloud, I mean any setup of multiple clusters. It doesn't have to be a managed cloud. You can train or perform inferences with your TensorFlow models on the devices that you care about today. And you can also train and deploy them on devices that you may care about in the future. Next is Apache Beam. So when we open sourced a lot of our components we faced the challenge that internally we use a data processing engine that allows us to run these large scale data processing pipelines. But in the open source world and in all of your companies, you may use different data processing systems. So we were looking for a portability layer. And Apache beam provides us with that portability layer.