Placeholder Image

Subtitles section Play video

  • KONSTANTINOS KATSIAPIS: Hello, everyone.

  • Good morning.

  • I'm Gus Katsiapis.

  • And I'm a principal engineer in TFX.

  • ANUSHA RAMESH: Hi, everyone.

  • I'm Anusha.

  • I'm a product manager in TFX.

  • KONSTANTINOS KATSIAPIS: Today, we'll

  • talk to you about our end-to-end ML

  • platform, TensorFlow Extended, otherwise known as TFX,

  • on behalf of the TFX team.

  • So the discipline of software engineering

  • has evolved over the last five-plus decades

  • to a good level of maturity.

  • If you think about it, this is both a blessing and a necessity

  • because our lives usually depend on it.

  • At the same time, the popularity of ML

  • has been increasing rapidly over the last two-plus decades.

  • And over the last decade or so, it's

  • been used very much, very actively

  • both in experimentation and production settings.

  • It is no longer uncommon for ML to power

  • widely-used applications that we use every day.

  • So much like it was the case for software engineering,

  • the wide use of ML technology necessitates the evolution

  • of the discipline from ML coding to ML engineering.

  • As most of you know, to do ML in production,

  • you need a lot more than just a trainer.

  • For example, the trainer code in an ML production system

  • is usually 5% to 10% of the entirety of the code.

  • And similarly, the amount of time

  • that engineers spend on the trainer

  • is often dwarfed by the amount of time engineers

  • spend in preparing the data, ensuring it's of good quality,

  • ensuring it's unbiased, et cetera.

  • At the same time, research eventually

  • makes its way into production.

  • And ideally, one wouldn't need to change stacks

  • in order to evolve an idea and put it into a product.

  • So I think what is needed here is flexibility, and robustness,

  • and a consistent system that allows

  • you to apply ML in a product.

  • And remember that the ML code itself

  • is a tiny piece of the entire puzzle.

  • ANUSHA RAMESH: Now, here is a concrete example

  • of the difference between ML coding and ML engineering.

  • As you can see in this use case, it took about three weeks

  • to build a model.

  • It's about a year.

  • It's still not deployed in production.

  • Similar stories used to be common at Google as well,

  • but we made things noticeably easier over the past decade

  • by building ML platforms like TFX.

  • Now, ML platforms in Google is not a new thing.

  • We've been building Google's scale machine learning

  • platforms for quite a while now.

  • Sibyl existed as a precursor to TFX.

  • It started about 12 years ago.

  • A lot of the design code and best practices

  • that we gained through Sibyl have been incorporated

  • into the design of TFX.

  • Now, while TFX shares several core principles with Sibyl,

  • it also augments it under several important dimensions.

  • This made TFX to be the most widely used end-to-end ML

  • platform at Alphabet, while being available

  • on premises and on GCP.

  • The vision of TFX is to provide an end-to-end ML

  • platform for everyone.

  • By providing this ML platform, our goal

  • is to ensure that we can proliferate

  • the use of ML engineering, thus improving

  • ML-powered applications.

  • But let's discuss on what it means to be an ML platform

  • and what are the various parts that are required

  • to help us realize this vision.

  • KONSTANTINOS KATSIAPIS: So today, we're

  • going to tell you a little bit more

  • about how we enabled global-scale ML

  • engineering at Google from best practices

  • and libraries all the way to a full-fledged end-to-end ML

  • platform.

  • So let's start from the beginning.

  • Machine learning is hard.

  • Doing it well is harder.

  • And doing it in production and powering applications

  • is actually even harder.

  • We want to help others avoid the many, many pitfalls that we

  • have encountered in the past.

  • And to that end, we actually publish papers, blog posts,

  • and other material that capture a lot of our learnings

  • and our best practices.

  • So here are but a few examples of our publications.

  • They capture collective lessons learned more than a decade

  • of applied ML at Google.

  • And several of them, like the "Rules of Machine Learning,"

  • are quite comprehensive.

  • We won't have time to go into them today

  • as part of this talk obviously, but we

  • encourage you to take a look when you get a chance.

  • ANUSHA RAMESH: While best practices are great,

  • communication of best practices alone would not be sufficient.

  • This does not scale because it does not get applied in code.

  • So we want to capture our learnings

  • and best practices in code.

  • We want to enable our users to reuse these best practices

  • and at the same time, give them the ability to pick and choose.

  • To that extent, we offer standard and data parallel

  • libraries.

  • Now, here are a few examples of libraries

  • that we offer for different phases of machine

  • learning to our developers.

  • As you can see, we offer libraries for almost every step

  • of your ML workflow, starting from data validation

  • to feature transformations to analyzing

  • the quality of a model, all the ways

  • till serving that in production.

  • We also make transfer learning easy by providing TensorFlow

  • Hub.

  • Ml-metadata is a library for recording and retrieving

  • metadata for ML workflows.

  • Now, the best part about these libraries

  • is that they are highly modular, which

  • makes it easy to plug into your existing ML infrastructure.

  • KONSTANTINOS KATSIAPIS: We have found that libraries are not

  • enough within Alphabet, and we expect the same elsewhere.

  • Not all users need or want the full flexibility.

  • Some of them might actually be confused by it.

  • And many users prefer out-of-the-box solutions.

  • So what we do is manage the release of our libraries.

  • We ensure they're nicely packaged and optimized,

  • but importantly, we also offer higher-level APIs.

  • And those come frequently in the form

  • of binaries or components-- or containers, sorry.

  • ANUSHA RAMESH: Libraries and binaries

  • provide a lot of flexibility to our users,

  • but this is not sufficient for ML workflows.

  • ML workflows typically involve inspecting and manipulating

  • several types of artifacts.

  • So we provide components which interact

  • with well-defined and strongly-typed artifact APIs.

  • The components also understand the context and environment

  • in which they operate in and can be interconnected

  • with one another.

  • We also provide UI components for visualization

  • of the said artifacts.

  • That brings us to a new functionality we're

  • launching in TensorFlow World.

  • You can run any TFX component in a notebook.

  • As you can see here, you can run TFX components cell by cell.

  • This example showcases a couple of components.

  • The first one is ExampleGen. ExampleGen ingests data

  • into a TFX pipeline.

  • And this is typically the first component that you use.

  • The second one is StatisticsGen, which

  • computes statistics for visualization and example

  • validation.

  • So when you run a component like StatisticsGen in notebook,

  • you can visualize something like this,

  • which showcases stats on your data

  • and it helps you detect anomalies.

  • The benefit of running TFX components in a notebook

  • is twofold.

  • First, it makes it easy for users to onboard onto TFX.

  • It helps you understand the various components of TFX,

  • and how you connect them, and the order in which you can go.

  • It also helps with debugging the various steps of your ML

  • workflow as you go through the notebook.

  • KONSTANTINOS KATSIAPIS: Through our experience though,

  • we've learned that components aren't actually

  • sufficient for production ML.

  • Manually orchestrating components

  • can become cumbersome and importantly error prone.

  • And then also understanding the lineage of all the artifacts

  • that are produced by those components-- produced