Placeholder Image

Subtitles section Play video

  • FRANCOIS CHOLLET: Hello, everyone.

  • I'm Francois.

  • And I work on the Keras team.

  • I'm going to be talking about TensorFlow Keras.

  • So this talk will mix information

  • about how to use the Keras API in TensorFlow

  • and how the Keras API is implemented under the hood.

  • So we'll cover another view of the Keras architecture.

  • We'll do a deep dive into the layer class and the model

  • class.

  • We'll have an overview of the functional API

  • and a number of features that are

  • specific to functional models.

  • We'll look at how training and inference work.

  • And finally, we'll look at custom losses and metrics.

  • So this is the overview of the Keras architecture and all

  • the different submodules and the different classes

  • you should know about.

  • The core of the Keras implementation is the engine

  • module, which contains the layer class--

  • the base layer class from which all layers inherit,

  • as well as the network class, which is--

  • it's basically kind of modeled its directed acyclic graph

  • layers--

  • as well as the model class which basically takes the network

  • class but adds training and evaluation and sitting on top

  • of it, and also the sequential class, which is, again,

  • another type of model which just wraps a list of layers.

  • Then we have the layers module, where all the action

  • instances-- usable instances of layers go.

  • We have losses and metrics with a base class for each,

  • and a number of concrete instances

  • that you can use in your models.

  • We have callbacks, optimizers, regularizers,

  • and constraints, which are much like the modules.

  • So in this presentation, we go mostly

  • over what's going on in the Engine module, and also losses

  • and metrics, not so much callbacks, optimizers,

  • regularizers, and constraints.

  • So in general, for any of these topics,

  • you could easily do a one-hour talk.

  • So I'm just going to focus on the most important information.

  • So let's start with the layer class.

  • So the layer is the core abstraction in the Keras API.

  • I think if you want to have a simple API,

  • then you should have one abstraction

  • that everything is centered on.

  • And in the case of Keras, it's a layer.

  • Everything in Keras pretty much is a layer or something

  • that interacts closely with layers,

  • like models and instances.

  • So a layer has a lot of responsibilities,

  • lots of built-in features.

  • At its core, a layer is a container for some computation.

  • So it's in charge of transforming a batch of inputs

  • into a batch of outputs.

  • Very importantly, this is batchwise computation,

  • meaning that you expect N samples as inputs,

  • and you're going to be returning N output samples.

  • And the computation should typically not

  • see any interaction between samples.

  • And so it's meant to work with eager execution, also

  • graph execution.

  • All the built-in layers in Keras support both.

  • But user-written layers could be only eager, potentially.

  • We support having layers that have two different modules--

  • AUDIENCE: So this would mean that using different layers

  • can support either in graph or on eager?

  • FRANCOIS CHOLLET: Yes.

  • AUDIENCE: Yeah, OK.

  • FRANCOIS CHOLLET: That's right.

  • And typically, most layers are going to be supporting both.

  • If you only support eager, it typically

  • means that you're doing things that

  • are impossible to express as graphs,

  • such as recursive layers, such as SEMs.

  • This is actually something that we'll

  • cover in this presentation.

  • So, yeah, so layers also support two modes-- so a training mode,

  • an inference mode--

  • to do different things.

  • And each mode, which is something

  • like dropout layer or the batch normalization layer.

  • There's a support for built-in masking, which

  • is about specifying certain features of timestamps

  • and inputs that you want to ignore.

  • This is very useful, in particular,

  • if you're doing sequence processing with sequences where

  • you have padded time steps or where

  • you have missing time steps.

  • A layer is also container for state, meaning variable.

  • So, in particular, a trainable state--

  • the trainable weights on the layer,

  • which is what parametrizes the computation of the layer

  • and that you update during back propagation;

  • and the nontrainable weights, which

  • could be anything else that is manually managed by the layer

  • implementer.

  • It's also potentially a container

  • that you can use to track losses and metrics that you define

  • on the fly during computation.

  • This is something we'll cover in detail.

  • Layers can also do a form of static type checking.

  • So they can check--

  • there is infrastructure that's built in

  • to check the assumptions that the layer is making

  • about its inputs that we can raise nice and helpful error

  • messages in case of user error.

  • We support state freezing for layers,

  • which is useful for things like fine-tuning,

  • and transfer learning, and GANs.

  • You have infrastructure for serializing and deserializing

  • layers and saving and loading a state.

  • We have an API that you can use to build directed

  • acyclic graphs of layers.

  • It's called a functional API.

  • We'll cover it in detail.

  • And in the near future, layers will also

  • have built-in support for mixed precision.

  • So layers do lots of things.

  • They don't do everything.

  • They have some assumptions.

  • They have some restrictions.

  • In particular, gradients are not something

  • that you specify on the layer.

  • You cannot specify custom a backwards pass on that layer,

  • but this is something we're actually considering adding,

  • potentially, something like a gradient method on the layer.

  • So it's not currently a feature.

  • They do not support most low-level considerations,

  • such as device placement, for instance.

  • They do not generally take into account distribution.

  • So they do not include distribution-specific logic.

  • At least, that should be true.

  • In practice, it's almost true.

  • So they're as distribution agnostic as possible.

  • And very importantly, they only support batchwise computation,

  • meaning that anything a layer does

  • should start with a tensor containing--

  • or a nested structure of tensors containing N samples

  • and should output also N samples.

  • That means, for instance, you're not

  • going to do non-batch computation, such as bucketing

  • samples of the same length.

  • When you're doing time-switch processing,

  • you're not going to process [INAUDIBLE] data

  • sets with layers.

  • You're not going to have layers that don't have an input

  • or don't have an output outside of a very specific case, which

  • is the input layer, which we will cover.

  • So this is the most basic layer.

  • You could possibly write it as a constructor in which you

  • create two-tier variables.

  • And you say these variables are trainable.

  • And you assign them as attributes on the layer.

  • And then it has a call method, which essentially contains

  • the batch of inputs to batchify this computation, in this case,

  • just w x plus b.

  • So what happens when you instantiate

  • this layer is that it's going to create these two variables,

  • set them as attributes.

  • And they are automatically tracked into this list,

  • trainable_weights.

  • And when you call the layer using __call operator,

  • it's just going to pass.

  • So it's going to defer to this call method.

  • So in practice, most layers you're going to write

  • are going to be a little bit more refined.

  • They're going to look like this.

  • So this is a lazy layer.

  • So in the constructor, you do not create weights.

  • And the reason you do not create weights

  • is because you want to be able to instantiate

  • your layer without knowing what the input shape is going to be.

  • Whereas in the previous case, here--

  • so this is the previous slide--

  • you had to pass the input dimension as a constructor

  • argument.

  • So in this case, you don't have to do this

  • because you're going to create the state in a build method,

  • which takes an input shape argument.

  • And when instantiated, the layer,

  • it does not have any weights.

  • And when you call it for the first time,

  • this __call operator is going to chain the build method that you

  • have here and the call method.

  • And the build method, you see, we

  • use these add weight shortcut.

  • So it's basically just a slightly shorter version

  • of creating a variable and assigning it on the layer.

  • OK.

  • Layers can also have nontrainable states.

  • So a trainable state is just variables

  • that are tracked in trainable_weights.

  • Nontrainable state is tracked in non_trainable_weights.

  • It's very simple.

  • So in this layer, in the constructor,

  • you create this self.total scalar variable

  • that starts at 0.

  • You specify it's nontrainable.

  • And in the computation method, you just update this variable.

  • And basically, you just keep track

  • of the total sum of the inputs seen by this layer.

  • So it's a kind of useless layer.

  • And as you see, every time you call

  • this layer, the value of these variables is being updated.

  • Layers can be nested.

  • So you can set--

  • you can set layer instances as attributes to a layer,

  • even if they're unbuilt, like here.

  • And when you do that, the outer containers-- so in this case,

  • the MLPBlock instance-- is going to be keeping

  • track of the trainable weights and nontrainable weights

  • of the underlying layers.

  • And all these layers, which in the constructor are unbuilt,

  • are going to be built, so have their variables

  • created the first time you call the outer instance.

  • And this is the most basic way in which

  • you can be using a layer.

  • You would just instantiate--

  • you know-- grab some loss function,

  • which could be anything.

  • Grab some optimizer.

  • Iterate with some data--

  • so we have input data and targets.

  • Open the GradientTape.

  • Call the layer inside the GradientTape--

  • so that's operations done by the call method

  • and recorded on the tape.

  • Call your loss function to get some loss value.

  • And then you use the tape and the loss value

  • to retrieve the gradients of the trainable state of the layer.

  • Then you apply these gradients.

  • So this is a full, end-to-end, training loop.

  • By that point, you know about layers,

  • which are containers for state and computation;

  • you know about trainable state, nontrainable state.

  • You know about nesting layers.

  • And you know about training them with this kind of loop.

  • So typically, you would put a spot

  • of the loop-- like everything starting

  • with opening GradientTape-- ending

  • with applying the gradients.

  • You put that in a TF function to get a graph execution

  • and faster performance.

  • So when you know all these things,

  • you can use the Keras API to implement literally anything.

  • However, it's going to be a fairly low-level way

  • of implementing things.

  • So by that point, you know everything except you actually

  • know nothing.

  • So let's go further.

  • One neat feature is that you can use layers

  • to keep track of losses and also metrics

  • that you define on the fly, doing computation.

  • Let's say, for instance, with our linear layer after we just

  • compute w, x, we want to keep track--

  • we want to</