Placeholder Image

Subtitles section Play video

  • Hi, I'm Robert Crowe,

  • and today I'm going to be talking about TensorFlow Extended,

  • also known as TFX,

  • and how it helps you put your amazing machine learning models

  • into production.

  • This is Episode 4 of our five-part series

  • on Real world machine learning in production.

  • We've covered a lot so far in Episodes 1-3,

  • so if you haven't seen those yet,

  • I'd really recommend watching them.

  • In today's episode,

  • we'll be talking about Distributed processing and components.

  • Let's get started.

  • ♪ (music) ♪

  • Let's talk about the components that come standard with TFX.

  • But, before we talk about the standard components,

  • let's talk about Apache Beam.

  • To handle distributed processing of large amounts of data,

  • especially compute-intensive data like ML workloads,

  • you really need a distributed processing pipeline framework

  • like Apache Spark, or Apache Flink, or Google Cloud Dataflow.

  • So, several of the TFX components run on top of Apache Beam,

  • which is a unified programming model

  • that can run on nearly any execution engine.

  • Beam allows you to use

  • the distributed processing framework you already have,

  • or choose one that you'd like,

  • rather than forcing you to use the one that we chose.

  • Currently, Beam Python can run on Flink, Spark, and Dataflow runners,

  • but new runners are being added.

  • It also includes a local runner,

  • which enables you to run a TFX pipeline in development

  • on your local system, like your laptop.

  • So, in the case of the Transform component,

  • for example,

  • we use Beam to perform feature-engineering transformations

  • like creating a vocabulary or doing PCA.

  • That could be running on your Flink or Spark cluster,

  • or on the Google Cloud, using Dataflow,

  • or, because it uses Beam,

  • you could migrate between them without changing your code.

  • In the case of the Trainer component,

  • we're really just using TensorFlow.

  • Remember when all we were thinking about

  • was training our amazing model?

  • That's the code we're using here.

  • Note that, currently, TFX only supports 1.X models and TF Estimators.

  • In the case of some components,

  • we're really only writing Python.

  • The Pusher component, for example, only needs Python to do its job.

  • So, when we put all these together and manage it all with an orchestrator,

  • this is what it looks like.

  • On the left, we're ingesting our data,

  • and on the right, we're pushing our saved models

  • to one or more of our deployment targets.

  • That includes modeled repositories like TensorFlow Hub,

  • or JavaScript environments using TensorFlow.js,

  • or native mobile applications using TensorFlow Lite,

  • or server farms using TensorFlow Serving,

  • or all of the above.

  • So now let's look at each of these components

  • in a little more detail.

  • First, we ingest our input data using ExampleGen.

  • ExampleGen is one of the components that runs on Beam.

  • It reads in data,

  • splits it into training and eval,

  • and formats it as TF examples.

  • This is what the configuration looks like for ExampleGen.

  • Very simple. Just two lines of Python.

  • Next, we have StatisticsGen.

  • StatisticsGen makes a full pass over the data, using Beam,

  • one full epoch,

  • and calculates descriptive statistics for each of our features.

  • To do that, it leverages the TensorFlow Data Validation library

  • which includes support for some visualization tools

  • that you can run in a Jupyter notebook.

  • That lets you explore and understand your data

  • and find any issues that you may have.

  • This is typical data-wrangling stuff,

  • the same thing we all do when we're preparing our data

  • to train our model.

  • Here's a better look at the visualization tools.

  • Right away, we can see that we might have a problem

  • with our trip_start_hour feature at 6:00 a.m.,

  • where we don't have a lot of data to make predictions.

  • Our model's performance at that time of day

  • might not be so great,

  • unless we go get some new data.

  • Our next component, SchemaGen,

  • also uses the TensorFlow Data Validation library.

  • It looks at the statistics which were generated by StatisticsGen

  • and tries to infer the types for each of our features,

  • including the range of categories for categorical features.

  • We can adjust the schema as needed,

  • like adding new categories that we expect to see.

  • Our next component, ExampleValidator,

  • takes the statistics from StatisticsGen,

  • and the schema, which may be the output of SchemaGen

  • or the results of user curation,

  • and looks for problems.

  • It looks for anomalies, missing values,

  • or values that don't match our schema,

  • and produces a report of what it finds.

  • Remember that we're taking in new data all the time,

  • so we need to be aware of problems when they pop up.

  • Transform is one of the more complex components

  • and requires a bit more configuration as well as additional code.

  • Transform uses Beam to do feature engineering,

  • applying transformations to your features

  • to improve the performance of your model.

  • For example,

  • Transform can create vocabularies, or bucketize values,

  • or run PCA over your input.

  • The code that you write depends on what feature engineering

  • you need to do for your model and dataset.

  • Transform will make a full pass over your data, one full epoch,

  • and create two different kinds of results.

  • For things like calculating the median

  • or standard deviation of a feature,

  • numbers which are the same for all examples,

  • Transform will output a constant.

  • For things like normalizing a value,

  • values which will be different for different examples,

  • Transform will output TensorFlow ops.

  • Transform will then output a TensorFlow graph

  • with those constants and ops.

  • That graph is hermetic,

  • so it contains all of the information you need

  • to apply those transformations,

  • and will form the input stage for your model.

  • That means that the same transformations are applied consistently

  • between training and serving,

  • which eliminates training/serving skew.

  • If, instead, you're moving your model from a training environment

  • into a serving environment or application,

  • and trying to apply the same feature engineering in both places,

  • you hope that the transformations are the same,

  • but sometimes you find that they're not.

  • We call that training/serving skew,

  • and Transform eliminates it

  • by using exactly the same code anywhere you run your model.

  • Now we're finally ready to train our model,

  • the part of the process that you often think about

  • when you think about machine learning.

  • Trainer takes in the Transform graph and data from Transform,

  • and the schema from SchemaGen,

  • and trains a model using your modeling code.

  • This is normal model training,

  • but when training is complete,

  • Trainer will save two different SavedModels.

  • One is a normal SavedModel that will be deployed to production,

  • and the other is an EvalSavedModel

  • that will be used for analyzing the performance of your model.

  • The configuration for Trainer is what you'd expect--

  • things like the number of steps,

  • and whether or not to use warm_starting.

  • The code that you create for Trainer is your modeling code,

  • so it can be as simple or complex as you need it to be.

  • To monitor and analyze the training process,

  • you can use TensorBoard, just like you would normally.

  • In this case, you can look at the current model-training run

  • or compare the results from multiple model-training runs.

  • This is only possible because of the ML-Metadata store,

  • which we talked about in our last episode.

  • TFX makes it fairly easy to do this kind of comparison,

  • which is often revealing.

  • Now that we've trained our model,

  • how do the results look?

  • The Evaluator component will take the EvalSavedModel

  • the Trainer created,

  • and the original input data,

  • and do deep analysis, using Beam

  • and the TensorFlow Model Analysis library.

  • It's not just looking at the top level results