Placeholder Image

Subtitles section Play video

  • FRANCOIS CHOLLET: Hello, everyone.

  • I'm Francois.

  • And I work on the Keras team.

  • I'm going to be talking about TensorFlow Keras.

  • So this talk will mix information

  • about how to use the Keras API in TensorFlow

  • and how the Keras API is implemented under the hood.

  • So we'll cover another view of the Keras architecture.

  • We'll do a deep dive into the layer class and the model

  • class.

  • We'll have an overview of the functional API

  • and a number of features that are

  • specific to functional models.

  • We'll look at how training and inference work.

  • And finally, we'll look at custom losses and metrics.

  • So this is the overview of the Keras architecture and all

  • the different submodules and the different classes

  • you should know about.

  • The core of the Keras implementation is the engine

  • module, which contains the layer class--

  • the base layer class from which all layers inherit,

  • as well as the network class, which is--

  • it's basically kind of modeled its directed acyclic graph

  • layers--

  • as well as the model class which basically takes the network

  • class but adds training and evaluation and sitting on top

  • of it, and also the sequential class, which is, again,

  • another type of model which just wraps a list of layers.

  • Then we have the layers module, where all the action

  • instances-- usable instances of layers go.

  • We have losses and metrics with a base class for each,

  • and a number of concrete instances

  • that you can use in your models.

  • We have callbacks, optimizers, regularizers,

  • and constraints, which are much like the modules.

  • So in this presentation, we go mostly

  • over what's going on in the Engine module, and also losses

  • and metrics, not so much callbacks, optimizers,

  • regularizers, and constraints.

  • So in general, for any of these topics,

  • you could easily do a one-hour talk.

  • So I'm just going to focus on the most important information.

  • So let's start with the layer class.

  • So the layer is the core abstraction in the Keras API.

  • I think if you want to have a simple API,

  • then you should have one abstraction

  • that everything is centered on.

  • And in the case of Keras, it's a layer.

  • Everything in Keras pretty much is a layer or something

  • that interacts closely with layers,

  • like models and instances.

  • So a layer has a lot of responsibilities,

  • lots of built-in features.

  • At its core, a layer is a container for some computation.

  • So it's in charge of transforming a batch of inputs

  • into a batch of outputs.

  • Very importantly, this is batchwise computation,

  • meaning that you expect N samples as inputs,

  • and you're going to be returning N output samples.

  • And the computation should typically not

  • see any interaction between samples.

  • And so it's meant to work with eager execution, also

  • graph execution.

  • All the built-in layers in Keras support both.

  • But user-written layers could be only eager, potentially.

  • We support having layers that have two different modules--

  • AUDIENCE: So this would mean that using different layers

  • can support either in graph or on eager?

  • FRANCOIS CHOLLET: Yes.

  • AUDIENCE: Yeah, OK.

  • FRANCOIS CHOLLET: That's right.

  • And typically, most layers are going to be supporting both.

  • If you only support eager, it typically

  • means that you're doing things that

  • are impossible to express as graphs,

  • such as recursive layers, such as SEMs.

  • This is actually something that we'll

  • cover in this presentation.

  • So, yeah, so layers also support two modes-- so a training mode,

  • an inference mode--

  • to do different things.

  • And each mode, which is something

  • like dropout layer or the batch normalization layer.

  • There's a support for built-in masking, which

  • is about specifying certain features of timestamps

  • and inputs that you want to ignore.

  • This is very useful, in particular,

  • if you're doing sequence processing with sequences where

  • you have padded time steps or where

  • you have missing time steps.

  • A layer is also container for state, meaning variable.

  • So, in particular, a trainable state--

  • the trainable weights on the layer,

  • which is what parametrizes the computation of the layer

  • and that you update during back propagation;

  • and the nontrainable weights, which

  • could be anything else that is manually managed by the layer

  • implementer.

  • It's also potentially a container

  • that you can use to track losses and metrics that you define

  • on the fly during computation.

  • This is something we'll cover in detail.

  • Layers can also do a form of static type checking.

  • So they can check--

  • there is infrastructure that's built in

  • to check the assumptions that the layer is making

  • about its inputs that we can raise nice and helpful error

  • messages in case of user error.

  • We support state freezing for layers,

  • which is useful for things like fine-tuning,

  • and transfer learning, and GANs.

  • You have infrastructure for serializing and deserializing

  • layers and saving and loading a state.

  • We have an API that you can use to build directed

  • acyclic graphs of layers.

  • It's called a functional API.

  • We'll cover it in detail.

  • And in the near future, layers will also

  • have built-in support for mixed precision.

  • So layers do lots of things.

  • They don't do everything.

  • They have some assumptions.

  • They have some restrictions.

  • In particular, gradients are not something

  • that you specify on the layer.

  • You cannot specify custom a backwards pass on that layer,

  • but this is something we're actually considering adding,

  • potentially, something like a gradient method on the layer.

  • So it's not currently a feature.

  • They do not support most low-level considerations,

  • such as device placement, for instance.

  • They do not generally take into account distribution.

  • So they do not include distribution-specific logic.

  • At least, that should be true.

  • In practice, it's almost true.

  • So they're as distribution agnostic as possible.

  • And very importantly, they only support batchwise computation,

  • meaning that anything a layer does

  • should start with a tensor containing--

  • or a nested structure of tensors containing N samples

  • and should output also N samples.

  • That means, for instance, you're not

  • going to do non-batch computation, such as bucketing

  • samples of the same length.

  • When you're doing time-switch processing,

  • you're not going to process [INAUDIBLE] data

  • sets with layers.

  • You're not going to have layers that don't have an input

  • or don't have an output outside of a very specific case, which

  • is the input layer, which we will cover.

  • So this is the most basic layer.

  • You could possibly write it as a constructor in which you

  • create two-tier variables.

  • And you say these variables are trainable.

  • And you assign them as attributes on the layer.

  • And then it has a call method, which essentially contains

  • the batch of inputs to batchify this computation, in this case,

  • just w x plus b.

  • So what happens when you instantiate

  • this layer is that it's going to create these two variables,

  • set them as attributes.

  • And they are automatically tracked into this list,

  • trainable_weights.

  • And when you call the layer using __call operator,

  • it's just going to pass.

  • So it's going to defer to this call method.

  • So in practice, most layers you're going to write

  • are going to be a little bit more refined.

  • They're going to look like this.

  • So this is a lazy layer.

  • So in the constructor, you do not create weights.

  • And the reason you do not create weights

  • is because you want to be able to instantiate

  • your layer without knowing what the input shape is going to be.

  • Whereas in the previous case, here--

  • so this is the previous slide--

  • you had to pass the input dimension as a constructor

  • argument.

  • So in this case, you don't have to do this

  • because you're going to create the state in a build method,

  • which takes an input shape argument.

  • And when instantiated, the layer,

  • it does not have any weights.

  • And when you call it for the first time,

  • this __call operator is going to chain the build method that you

  • have here and the call method.

  • And the build method, you see, we

  • use these add weight shortcut.

  • So it's basically just a slightly shorter version

  • of creating a variable and assigning it on the layer.

  • OK.

  • Layers can also have nontrainable states.

  • So a trainable state is just variables

  • that are tracked in trainable_weights.

  • Nontrainable state is tracked in non_trainable_weights.

  • It's very simple.

  • So in this layer, in the constructor,

  • you create this self.total scalar variable

  • that starts at 0.

  • You specify it's nontrainable.

  • And in the computation method, you just update this variable.

  • And basically, you just keep track

  • of the total sum of the inputs seen by this layer.

  • So it's a kind of useless layer.

  • And as you see, every time you call

  • this layer, the value of these variables is being updated.

  • Layers can be nested.

  • So you can set--

  • you can set layer instances as attributes to a layer,

  • even if they're unbuilt, like here.

  • And when you do that, the outer containers-- so in this case,

  • the MLPBlock instance-- is going to be keeping

  • track of the trainable weights and nontrainable weights

  • of the underlying layers.

  • And all these layers, which in the constructor are unbuilt,

  • are going to be built, so have their variables

  • created the first time you call the outer instance.

  • And this is the most basic way in which

  • you can be using a layer.

  • You would just instantiate--

  • you know-- grab some loss function,

  • which could be anything.

  • Grab some optimizer.

  • Iterate with some data--

  • so we have input data and targets.

  • Open the GradientTape.

  • Call the layer inside the GradientTape--

  • so that's operations done by the call method

  • and recorded on the tape.

  • Call your loss function to get some loss value.

  • And then you use the tape and the loss value

  • to retrieve the gradients of the trainable state of the layer.

  • Then you apply these gradients.

  • So this is a full, end-to-end, training loop.

  • By that point, you know about layers,

  • which are containers for state and computation;

  • you know about trainable state, nontrainable state.

  • You know about nesting layers.

  • And you know about training them with this kind of loop.

  • So typically, you would put a spot

  • of the loop-- like everything starting

  • with opening GradientTape-- ending

  • with applying the gradients.

  • You put that in a TF function to get a graph execution

  • and faster performance.

  • So when you know all these things,

  • you can use the Keras API to implement literally anything.

  • However, it's going to be a fairly low-level way

  • of implementing things.

  • So by that point, you know everything except you actually

  • know nothing.

  • So let's go further.

  • One neat feature is that you can use layers

  • to keep track of losses and also metrics

  • that you define on the fly, doing computation.

  • Let's say, for instance, with our linear layer after we just

  • compute w, x, we want to keep track--

  • we want to add an activation loss on the value

  • x, which is just going to be the sum of this output times--

  • So it's not actually a great loss because it should probably

  • be squared--

  • sum of the square instead of just sum, but whatever--

  • times some factor.

  • And when you have a layer like this, every time you call it,

  • the scalar tensor that you added here in this add_loss call

  • is going to be tracked in this layer.losses list.

  • And every time you call the layer, this gets reset.

  • When you have nested layers, then the outer container

  • is going to keep track of the losses of the inner layers.

  • And you can call the inner layers multiple times.

  • It's not going to reset the losses until you actually

  • call the outer container.

  • So the way you would use this feature is something like this.

  • After you open your GradientTape and you call your layer

  • and you compute the main loss value,

  • you would add to this loss value the sum of the losses collected

  • during the forward pass.

  • So you can use this feature to do things

  • like weight triggerization, activity regularization compute

  • things like the KL divergence, so

  • all kinds of losses that basically

  • are easier to compute when you have access

  • to intermediate results in the--

  • AUDIENCE: Just a question.

  • So if you have call at loss in a inner layer, does it call--

  • but that layer is contained in another layer--

  • does it call add_loss on the outer layer too?

  • FRANCOIS CHOLLET: Yes.

  • So for instance, if linear layer and multiple layers inside it,

  • when you retrieve this, this should

  • be a linear layer that losses--

  • not [INAUDIBLE] losses, whatever.

  • It's going to recursively retrieve all the losses,

  • including the top-level losses.

  • AUDIENCE: So, I guess my question

  • is, does a layer know that it's being called from inside the--

  • FRANCOIS CHOLLET: That's correct,

  • meaning that's when it's called from inside,

  • you can create multiple terms.

  • It's not going to reset the losses.

  • The losses are only reset when you call the top level

  • container.

  • So there is a call context thing.

  • AUDIENCE: That's-- I would expect it to be reset every

  • time you call it, but the parents' losses [INAUDIBLE]..

  • FRANCOIS CHOLLET: So if you do that,

  • you could not share a layer that creates

  • losses inside a bigger model.

  • AUDIENCE: I mean, I guess I was thinking

  • that the inner layer would reset,

  • but the outer layer would not reset.

  • So it would keep--

  • as long as all the inner layer losses [INAUDIBLE]..

  • FRANCOIS CHOLLET: So they're gathered on the fly.

  • So that's not exactly accurate.

  • But yeah, anyway, so yeah.

  • AUDIENCE: How does the resetting happen?

  • Can you explain?

  • FRANCOIS CHOLLET: Yeah, sure.

  • Basically, you just-- so you are going to--

  • so you can-- it's called at the end of __call for the outer

  • container.

  • And it's called recursively.

  • So it's going to clear the losses of all the inner layers.

  • If you want to do it manually, all layers and models

  • have a reset losses, I believe it's

  • called, method that you can use to force clear

  • all the losses, which could happen, for instance,

  • if you have multiple calls of the same model.

  • [INAUDIBLE] potentially the [INAUDIBLE]

  • use case could be-- anyway,

  • AUDIENCE: Sorry, so I didn't understand

  • how reset losses is not called.

  • How does a layer know that it's been called as--

  • from an outer layer?

  • AUDIENCE: In _call, there's basically a contact manager

  • that sort of says you're in _call.

  • And so that's why as you go down the line, if you're

  • calling a layer that's already being called

  • inside another layer, it can use that contact manager

  • to know whether it's the top-level call.

  • AUDIENCE: OK.

  • FRANCOIS CHOLLET: So, yeah.

  • So layers also support serialization.

  • So if you want to make a layer serializable,

  • you just implement a get_config method,

  • which typically just packs the constructor arguments

  • into a dictionary.

  • And when you've implemented this get_config method,

  • you can serialize your layer as this [INAUDIBLE] config dict,

  • which is JSON serializable.

  • And you can use it to re-instantiate the same layer.

  • So this does not keep track of the state of the layer,

  • meaning the value of the weight.

  • So this is done separately.

  • And so layers also supports two modes--

  • training mode and inference mode.

  • If you want to use this feature, you

  • would have a training argument in call.

  • So this is a very simple example of a BatchNormalization layer,

  • where, when you're in training mode,

  • you're going to be computing the mean and variance

  • of the current batch.

  • And you're going to use these statistics to normalize

  • your inputs.

  • And you're going to be updating the moving mean

  • and variance on the layer, which are nontrainable weights here.

  • And if you're in inference mode, you're

  • just going to use the moving statistics to normalize

  • your inputs.

  • So now let's move on to the model class.

  • You saw in one of the previous examples

  • that layers can be nested.

  • If you just switch in this example from,

  • I think, the MLP class inherit from the model class instead

  • of the layer class, then essentially nothing changes

  • except that now you have access to a training and evaluation

  • and inference and saving API.

  • So once you've inherited from model,

  • you can do things like mlp.compile

  • with an optimizer in the loss instance.

  • Then you can call fit, which is going to automatically iterate

  • over this data set and minimize this BinaryCrossentropy

  • from logits loss using the Adam optimizer.

  • It's going to iterate 10 times over the data set.

  • And you can also save the state of this model

  • with this mlp.save method.

  • So what's the difference between the layer and the model?

  • In short, it's that a model handles

  • top-level functionality.

  • So a model is a layer.

  • So it does everything layer does in terms

  • of network construction.

  • It also has these compile, fit, evaluate, and predict methods.

  • It's about saving.

  • So when you call save, that includes not

  • only the configuration of the model, like the

  • get config thing we saw previously.

  • It also includes the state of the model

  • and the value of the weights.

  • It also includes the optimizer that the model was

  • compiled with and the state of the optimizer.

  • It also supports some basic forms of model summarization

  • and visualization.

  • I can call model Summary, which is

  • going to return a description of all the layers inside the model

  • and the number of parameters that the model uses and so on.

  • In short, the layer class corresponds

  • to what is usually referred to as a layer,

  • like when you talk about convolution

  • layer, recurrent layer, so on.

  • It can also be used to refer to as what is sometimes

  • called a block, like a resonant block or inception block.

  • So a layer is basically either a literal layer or block

  • in a bigger model.

  • And the model is really like the top level of things--

  • the outer thing-- like what people refer

  • to as a model or a network.

  • So typically, what you will do is use the layer class

  • to define inner computation blocks

  • and use the model class to define the one outer model-- so

  • the thing that you're actually going

  • to be training and saving and exporting for production.

  • For instance, if you are working on the ResNet50 model,

  • you'd probably have several ResNet blocks

  • subclassing the layer class.

  • And then you would combine them into one big subclass

  • model on which you would be able to compile and fit and save

  • and so on.

  • One situation that's specific to TensorFlow 2

  • that not many people know about is

  • that by default, when you call compile and fit,

  • you're going to be using graph execution because it's faster.

  • However, if you want to execute your model eagerly,

  • you can pass this run_eagerly argument in compile.

  • You can also just set it directly

  • as an attribute on the model instance.

  • So when you do that, all your call methods

  • are in the top level MLP model.

  • And so all the inner layers are going to be executed eagerly.

  • If you don't do this, by default,

  • you're going be generating a graph

  • function that does the competition, which is faster.

  • So--

  • AUDIENCE: Before you go on, I had a question.

  • Could you explain the difference between compile and fit?

  • Like what goes between-- what goes in compile

  • and what goes in fit?

  • I don't feel like that's a--

  • FRANCOIS CHOLLET: Right.

  • So compile is about configuring the training process.

  • So you specify which optimizer you want to use.

  • You specify which loss you want to minimize.

  • You specify which metrics you want to try.

  • And anything else is going to modify the way

  • the computation is done.

  • It's going to modify the execution function, which

  • is something we're going to go over in more detail.

  • And in fit, you're passing the data itself and information

  • about how you want this data to be processed,

  • like how many times you want to iterate over

  • the data, potentially the size of the batches you want to use

  • to iterate with the data, which callbacks

  • you want to be called at different stages of training,

  • and so on.

  • So its configuration and compile,

  • and basically passing the data and related metadata in fit.

  • So typically, you compile your model once.

  • But you might be calling fit multiple times.

  • AUDIENCE: So are we to think about it is if in TF1 style,

  • the stuff that doesn't compile is

  • the stuff that you'd use when you're building your graph.

  • And the stuff that goes to fit is simply batch.session.run?

  • FRANCOIS CHOLLET: Yes.

  • That's correct.

  • So let's move on to functional models.

  • So in the previous example, you saw a subclass model, so

  • essentially something that you wrote

  • subclassing the model class.

  • In practice, very often in the literature,

  • you see the planning models that look

  • like this, that look like directed acyclic graphs.

  • So on top, you have [INAUDIBLE].

  • At the bottom, you have various bits of transformer.

  • So these are directed acyclic graphs

  • of layers that are connected with these arrows.

  • So there's an API in Keras for configuring

  • the connectivity of the directed acyclic graph of layers.

  • It's called a functional API.

  • It looks roughly like this.

  • You start with input nodes, which

  • are like these input objects.

  • And then you're going to be calling layer instances

  • on that object.

  • So you can think of this input object

  • as a spec describing a tensor.

  • So doesn't actually contain any data.

  • It's just a spec specifying the shape of the input you're going

  • to be expecting, the data type of the input you're going to be

  • expecting-- maybe include--

  • so annotate with the name.

  • And every time you call layer, that's

  • roughly the action of drawing an arrow from one layer

  • instance to the next.

  • So here, you're going to have one input node, two layers.

  • And here you are drawing an arrow from the input node

  • to the first layer.

  • Here you're drawing an arrow from previous layer

  • to this new layer, and doing that for the output layer

  • finally.

  • And then you can instantiate a model

  • by just specifying its inputs and its outputs.

  • And what it's going to do when you do this is basically

  • build this directed acyclic graph, right here.

  • So you can actually plot this graph.

  • You can call utils.plot_model on your model instance.

  • It's going to generate this image.

  • So a functional model is basically just

  • like any other model.

  • But it's just a model that you do not write yourself.

  • So it has a bunch of methods that are autogenerated.

  • In particular, the call method is autogenerated.

  • The build method is autogenerated.

  • And the serialization get_config is autogenerated.

  • Yes?

  • AUDIENCE: You said the input does not have data.

  • But could it have data?

  • Like if you wanted to be able to check

  • you work as you went along.

  • FRANCOIS CHOLLET: No.

  • It could not.

  • So when you define your model, you're not actually

  • executing anything.

  • You're just configuring a graph.

  • So it's going to be running a bunch of checks.

  • But they're only compatibility checks,

  • essentially checking that the layers that you are connecting

  • together are compatible, that the data can actually

  • be transmitted.

  • But there's no execution.

  • So there's no notion of data.

  • It's really just an API for building a DAG.

  • So, yeah, so for instance, the call method is autogenerated.

  • It's just going be something that we called a graphic layer

  • executor.

  • And so when you call your model, it's

  • going to be basically running through this graph of layers,

  • calling each layer in succession with the proper inputs.

  • And likewise, assuming that each layer in your graph of layer

  • implements is get_config method, then

  • you can call get_config on your DAG

  • and get something that you can use to re-instantiate

  • the same model.

  • AUDIENCE: Excuse me.

  • So can we go back to [INAUDIBLE]..

  • So on the line, y equals to model x.

  • So the x is a layer of the input x.

  • FRANCOIS CHOLLET: x is the output of the layer.

  • So you start by instantiating this input object

  • using tf.keras.Input, which is basically a spec for a tensor.

  • It has a--

  • AUDIENCE: This-- the x here is different than the x above.

  • FRANCOIS CHOLLET: Oh, right.

  • Yeah, sorry.

  • This is confusing.

  • Yeah.

  • Yeah.

  • This is supposed to be a tensor, like an eager tensor,

  • for instance.

  • Right.

  • Sorry.

  • Yeah.

  • It's a bit confusing.

  • Sorry.

  • Yeah, so you can call your model like you

  • would call any other model instance or any function

  • on a bit of data.

  • It's going to return a bit of data.

  • So what does a functional API do, roughly?

  • Well, it's really an API for configuring DAGs.

  • It's targeted more at users than developers.

  • People who are very comfortable with writing classes

  • and functions in Python will have

  • no issues using model subclassing

  • to create new models.

  • But when you want to use more like building APIs

  • and you want more handholding, then the functional API

  • is actually a lot easier to use.

  • And its maps very closely to how you think about your model.

  • So if you're not a Python developer

  • because you think about your models typically like that,

  • in terms of DAGs of layers.

  • So it's declarative, meaning that you write no logic.

  • You write no functions.

  • You subclass nothing.

  • It's really just configuration level.

  • So all the logic that's going to be executed during--

  • when you actually call your model on some data,

  • it's contained inside of layers.

  • Typically, you don't write it.

  • So that means that all debugging that

  • happens when you build these models

  • is done statically at construction time in terms

  • of compatibility checks between layers.

  • That means that any functional model

  • that you can instantiate without the framework,

  • creating as you-- is the model that's going to run.

  • So it means you don't write any actual Python code.

  • You're just connecting nodes in a DAG,

  • meaning you don't actually write bugs.

  • The only kind of bugs you can be writing

  • is misconfiguration of your DAG topology, which is what

  • they call topology debugging.

  • And that can be done visually by literally printing

  • your graphs of layers and looking

  • at how they're connected.

  • And on the [INAUDIBLE] sides, these models

  • can be expressed as static data structures that

  • can generate code-- that can generate a call method, meaning

  • that they're inspectable.

  • For instance, you can retrieve intermediate activations

  • in your DAG and use them to build a new model.

  • That's actually very useful when you

  • do transfer learning because it makes it very

  • easy to do feature extraction.

  • That means your models are also plottable.

  • You can actually generate these little graphs

  • because you have a literal data structure in the random model.

  • And you can serialize them.

  • When you have subclass model in your model topologies,

  • it's actually a bit of a Python byte code,

  • which is harder to serialize.

  • You have to [INAUDIBLE] it.

  • There's one very important restriction

  • when you're writing layers that should

  • be compatible with the functional API, which

  • is that all the inputs of your layer

  • should be contained in the first argument,

  • meaning that if you have multiple inputs,

  • you should use [INAUDIBLE] argument

  • or maybe dictionary argument if you have many inputs

  • and they have names.

  • AUDIENCE: Tensor inputs?

  • FRANCOIS CHOLLET: Yes.

  • Tensor inputs.

  • So essentially, anything that comes from another layer.

  • So like anything that you want to transfer using these arrows.

  • Each arrow corresponds to the first argument

  • in the call of the layer.

  • So this is a restriction that also exists in Torch 7,

  • in case you know.

  • A lot of people who have been criticizing this restriction

  • are people who say Torch 7 is the pinnacle of deep learning

  • API.

  • So I think this is funny.

  • OK.

  • So what actually happens when you

  • build this functional model?

  • Let's go in detail.

  • When you instantiate this input object, it basically--

  • this spec object of the shape and dtype--

  • when you create it, it also creates an input layer,

  • which is basically a node in your graph of layers.

  • And this input objects is going to have an _keras_history field

  • of metadata on it that tracks who created it,

  • where it comes from.

  • And every time you call a layer instance

  • on one of these spec objects, what it's going to be doing

  • is returning a new spec object--

  • nested structure spec object-- with the inferred shape

  • and dtype corresponding to the computation that would normally

  • be done by the layer.

  • We can practice-- no actual computation

  • is happening when you do this.

  • And this output has an updated _keras_history metadata that

  • tracks the node that you just created in your graph

  • of layers.

  • And finally, when you instantiate your model

  • from these inputs and outputs, you're

  • going to recursively reconstruct--

  • retrieve every node that has been created

  • and check that they actually do form

  • a valid DAG of layer calls.

  • AUDIENCE: So this _keras_history field,

  • does that have transitively a reference to all the layers

  • that were--

  • FRANCOIS CHOLLET: So these are weighted contents.

  • So it's basically-- keras is just basically the coordinates

  • of your tensor in a 3D space--

  • discrete 3D space.

  • It's basically a tuple--

  • a named tuple with three entries.

  • So the first one is the reference

  • to the layer that's created this tensor--

  • this spec.

  • The second one is the node_index because layers

  • can be called multiple times.

  • So it's not true that that's actually one node that's

  • instantiated with one layer.

  • A node is instantiated with a layer call.

  • So if you call layer instance multiple times,

  • there's multiple nodes.

  • So this node index is basically the index of the nodes created

  • by the layer as referenced in layer._output_nodes.

  • And finally, there's a tensor_index.

  • So the tensor_index is basically to handle

  • the case of multioutput layers.

  • If you have a layer with a bunch of tensor outputs, what they're

  • going to do is deterministically flatten these outputs

  • into a list and then index this list.

  • And this tensor_index tells you the index

  • of this specific tensor among the values

  • since that's returned by this layer call.

  • AUDIENCE: Can you [INAUDIBLE] why,

  • and I just call like tf.relu on it,

  • it will still populate the keras history and--

  • FRANCOIS CHOLLET: Not immediately.

  • So the tf.relu, tf.nn.relu, is not going to create this

  • _keras_history object.

  • But when this object is seen again by a layer,

  • it's going to check--

  • it's going to walk the graph and check whether it was originally

  • coming from a layer.

  • And if it does, then it's going to wrap the various ops that

  • were not containing layers.

  • It's going to wrap them into new layers

  • so that you can actually step out

  • of the keras graph of layers and insert any TensorFlow op.

  • And each TensorFlow op is going to be treated as its own nodes

  • in the graph of layers.

  • AUDIENCE: So two questions.

  • One is, let's say that happened twice.

  • Would there be two Relu layers created?

  • Or would it--

  • FRANCOIS CHOLLET: Yes.

  • So it's one layer per op.

  • AUDIENCE: --same node in the graph?

  • FRANCOIS CHOLLET: Yeah.

  • So one node corresponds to one call of the layer.

  • So every time you call layer, I mean,

  • even if it's the same layer that's in your node,

  • because it has to be a DAG.

  • AUDIENCE: OK.

  • So I guess when you say Relu, it creates a tensor, right?

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: And that tensor is then passed to another layer?

  • FRANCOIS CHOLLET: Yeah.

  • AUDIENCE: At that point, does it create a layer for the Relu?

  • FRANCOIS CHOLLET: That's correct.

  • Yes.

  • AUDIENCE: So right then.

  • And then if I were to pass that output of the Relu

  • to another layer--

  • not the first layer--

  • would it create another layer for the--

  • [INTERPOSING VOICES]

  • --reuse the--

  • FRANCOIS CHOLLET: I believe it does not recreate the layer.

  • So it reuses the previous layer.

  • But I will have to check.

  • AUDIENCE: This is all making me wonder about the lifetimes

  • of all these layers.

  • Is there references to all these layers just kept forever?

  • Or is that--

  • FRANCOIS CHOLLET: Yes.

  • They are kept forever.

  • AUDIENCE: Layers are never garbage collected.

  • FRANCOIS CHOLLET: Yes.

  • There's actually a utility in the back end

  • to force destroy all of this.

  • It's called clear session.

  • So, yeah, so this illustrates the importance of having

  • these three coordinates.

  • You can have-- so this is some random variation of autoencoder

  • example I took from the internet.

  • So it's interesting because it shows

  • layers that have multiple inputs and multiple outputs.

  • And, like, for instance, the outputs

  • of this layer are going to be going into-- one of the outputs

  • is going to be going into this other layer.

  • One of the other outputs is going

  • to be going further downstream.

  • So with these three coordinates, you

  • can do completely arbitrary graph topology in that case.

  • So there's a lot of keras features

  • that are specific to these functional

  • models, in particular, the ability

  • to do static compatibility checks on the inputs

  • of a layer; the ability to do whole-model saving, meaning

  • saving a file that enables you to reinstantiate

  • the exact same Python objects with the same state;

  • the ability to do model plotting, which

  • is something we already just saw;

  • automatic masking, which is something

  • which we cover in detail; and dynamic layers,

  • which is something that's not really relevant if you're not

  • using the functional API.

  • So let's talk about static type checking.

  • When you create a custom layer, you can set an input_spec field

  • on it, which is going to have to be an instance of this input

  • spec object or maybe a nested structure input

  • spec object, which describes the expectations that the layer has

  • with regard to which we calling it on.

  • And when you call your layer for the first time here--

  • you instantiate it here when you call it--

  • first, it's going to--

  • sorry-- it's going to check that this input is

  • compatible with the current input spec, which was set here

  • in the constructor, which just says the tensor should have

  • at least rank 2.

  • Then it's going to actually build a layer.

  • But here, using this build method, the input shape that's

  • passed here gives us more refined information

  • about the expectations of the layer.

  • So we update the input spec.

  • So not only should it be a rank at least 2.

  • The last axis-- so axis minus 1.

  • It's the last axis--

  • should have exactly this value.

  • So after you build, it's going to recheck

  • that the updated input spec is compatible with this input.

  • That's the right last dimensions.

  • And finally, it's going to call the [INAUDIBLE]..

  • So every time you call a layer in the functional API

  • to build these graphs, if the layer has set this input spec

  • object, it's going to be checking compatibility

  • and raising a very detailed-- and, therefore,

  • our messages in case of the compatibility,

  • it's going to tell you what you passed, what was expected,

  • and what you can do to fix it.

  • You can also do whole-model saving,

  • meaning that you have this get config method that

  • is autogenerated.

  • You can use it to recreate the same model, in this case,

  • without the state.

  • You can also just call save.

  • When you call load_model, then this object

  • is the exact same Python object pretty much

  • with the same topology, the same state.

  • And you can load it across platforms.

  • For instance, you can also load it in pascal.js.

  • Yes.

  • AUDIENCE: If I created my own custom layer,

  • do I need to do something special--

  • FRANCOIS CHOLLET: Absolutely.

  • So if you want to write custom layers

  • and reload your model in a completely different

  • environment, that environment should have access

  • to some implementation of your custom layer.

  • So if it's a Python environment, then you basically

  • just need to make sure the code is available.

  • And you would wrap your load_model call inside a scope

  • where you specify the custom objects

  • you want to be taken into account

  • during the deserialization process.

  • If you want to load your model into a JS, for instance,

  • you first have to write a JavaScript/TypeScript

  • implementation of your layer.

  • And, yeah, model plotting, which is something we already saw,

  • it's pretty useful when you want to check

  • the correctness of the DAGs that you're

  • building unless they're too large, in which case

  • it's not so great.

  • So this is just a very simple two input, two output model--

  • three input, sorry, two output model--

  • where you have some title fields,

  • some body fields, some DAG field.

  • They're doing some processing.

  • And you end up with a priority prediction

  • and department predictions.

  • And this is just something from some random tutorial.

  • And then one neat feature is automatic masking.

  • So let's go over that in detail.

  • Here's just a simple end-to-end example.

  • You start from an input object that's

  • going to be a variable length sequence of ints.

  • It's called word sequence.

  • So it's just going to be a sequence of word indices.

  • You embed it with this embedding layer.

  • And in the embedding layer, you specify mask_zero equals true.

  • So this layer is going to be generating a mask using zero

  • entries in any data you're going to be passing along this graph

  • connection.

  • And every time you call a layer that's compatible with masking,

  • it's going to pass through this mask.

  • And some layers, like the LSTM layer,

  • are going to be mask consumers.

  • So they're going to be looking at the mask that's passed

  • and use it to ignore the padded entries.

  • And some layers-- for instance, if you

  • have an LSTM layer that does not return sequences

  • and that just basically just returns

  • a single vector, the sample, including the entire sequence,

  • it's going to be destroying the mask.

  • So the next layer is not going to be seeing the mask.

  • So when you do something like this,

  • you're automatically telling the LSTM layer, which

  • is significantly downstream from your embedding layer,

  • to do the right thing with your sequences.

  • So if you're not super-familiar with the specifics of masking,

  • this is very simple and magical.

  • You just say mask_zero at the start of your model.

  • And suddenly, all the layers that

  • should be aware of masking just do the right thing.

  • AUDIENCE: A little more detail about what-- what is masking?

  • Is it like-- is there a vector of Booleans or something?

  • FRANCOIS CHOLLET: Yes.

  • Absolutely.

  • So here's the detail.

  • A mask is, indeed, a vector--

  • a tensor of Booleans.

  • Each sample has its own mask.

  • And each mask is basically just a plain vector

  • of ones and zeros.

  • It does not make assumptions about things like padding,

  • for instance.

  • So you could mask completely arbitrary time steps.

  • Yeah.

  • So you have three types of layers that interact with mask.

  • You have layers that will consume a mask.

  • So in order to consume mask, just specify this mask argument

  • in the call signature.

  • And this will be your batch of Boolean vectors.

  • You can also pass through a mask.

  • There's almost nothing you need to do.

  • But it's opt-in.

  • You need to explicitly say it supports masking.

  • The reason why this is opt-in is because many layers

  • are going to be changing the shape of the inputs that

  • are going to be returning outputs that are not

  • the same shape as the inputs.

  • And this interacts with masking.

  • So it's typically not true to assume-- for instance, here--

  • to typically be not true to assume

  • that the mask that needs to be consumed by this LSTM layer

  • is the same that was generated by the embedding layer.

  • When in practice, it's the case.

  • But if this dense layer had been doing anything

  • to change the shape of the inputs,

  • it would have been different.

  • And then you have mask-generating layers.

  • So for instance, the embedding layer does something like this.

  • It looks at the inputs and gives you a mask of Boolean entries

  • that is one for all non-zero entries in your input.

  • And in case you have a layer that

  • modifies the shape of your inputs, it should also--

  • if it wants to be compatible with masking--

  • should also implement this computation called mask method.

  • For instance, if you have concatenate

  • layer that takes multiple inputs--

  • some of which may be masks, some of which may not--

  • the output mask should do the concatenation

  • of the different masks.

  • AUDIENCE: So that compute mask method

  • didn't use the mask argument.

  • But normally you would?

  • FRANCOIS CHOLLET: Sorry, what?

  • Oh, yeah, yeah.

  • So you can-- yes.

  • So for instance, the embedding layer

  • is going to ignore the mask argument

  • and just generate the mask based on inputs.

  • But if instead you have a concatenate layer,

  • it's going to ignore the inputs and just

  • generate the mask based on the prior mask argument.

  • So let's look in a lot of detail at how all of this

  • is implemented.

  • So what happens when you call a layer instance

  • on a bunch of symbolic inputs?

  • So first thing we do is static type checking--

  • determining whether all the inputs

  • are compatible with this layer.

  • If the layer is unbuilt, we're going to build it--

  • potentially check input compatibility, again,

  • in case input spec was updated.

  • We're going to check whether this layer is mask consumer.

  • Does it have a mask argument in its call method?

  • If yes, we're going to retrieve the mask associated

  • with the inputs of the layer, which we do via metadata.

  • We're going to open a graph scope,

  • check if our layer is graphable.

  • So this is a concept we're going to look at in more detail

  • afterwards.

  • If the layer can be turned into a graph,

  • we're going to autograph call automatically.

  • And we're going to call this--

  • so this is in order to convert if statements, for instance,

  • or for loops, into symbolic conditionals.

  • We're going to call the call method that was autographed

  • using the proper mask and training

  • argument that we retrieved from the current context.

  • If the layer happens to be dynamic,

  • meaning that you cannot convert it to graph,

  • you're just going to return a brand new symbolic tensors.

  • And in order to know what shape and detail

  • these symbolic tensors should be,

  • you're going to use the static shape inference

  • method of your layer.

  • Meaning that if you have a layer that's

  • dynamic, that's nongraphable, and you

  • want to use it in functional API,

  • it should implement this static shape inference method,

  • compute output shape.

  • For no other use case are you going

  • to need compute output shape.

  • So finally, you create the nodes in the graph

  • of layers from this call.

  • You set the metadata on the outputs, which are either--

  • this is brand new symbolic tensors created

  • using static shape inference-- or the outputs

  • of the actual graph mode call, so

  • with the metadata above the node.

  • And finally, you set the mask metadata,

  • which is what the next layer is going

  • to retrieve in case that layer is a mass consumer.

  • AUDIENCE: So what's happening in step 5?

  • What is the graph scope that you're talking about?

  • FRANCOIS CHOLLET: So Keras maintains its own graph,

  • which is a fun graph object.

  • And when it creates a symbolic tensor,

  • it's always in that graph.

  • So before you call the graph mode call

  • or before you instantiate new symbolic tensors, which

  • are basically placeholders, first you

  • need to open the graph scope.

  • AUDIENCE: Slight correction.

  • It will enter that graph unless a different one

  • has been specified.

  • FRANCOIS CHOLLET: Yes, which is only valid in V1.

  • In V2, typically, we only ever--

  • like the only graph you're going to be manipulating

  • is the Keras graph.

  • Everything else is going to be either eager or TF function.

  • So we mentioned the notion of dynamic layers.

  • So what's dynamic layer?

  • Well, maybe you remember this BatchNormalization example.

  • There's actually something very subtle going on with this

  • BatchNormalization example, which means that it cannot be

  • turned into a static graph.

  • And that is because--

  • so it uses this if/else statement.

  • And inside one of the branches, it

  • does variable updates and the other branch, it does not.

  • And this actually does not play well with autograph.

  • So this is actually something that's

  • fairly easy to fix by simply having

  • symmetrical conditional branches, where

  • you have the same statements assigning

  • nothing in the other branch.

  • However, for the sake of argument,

  • let's say we cannot graph this layer.

  • What are we going to do?

  • Well, we are going to pass this dynamic equals true argument

  • in the constructor.

  • And that tells the framework that this layer is not

  • compatible with graph execution.

  • It should never be executed in a graph.

  • And when you build a functional API model using this,

  • it's going to do what we were mentioning in step 6.

  • It's just going to use static shape inference

  • to compute the outputs.

  • And when you call fit, it's going

  • to be using pure eager execution without forcing

  • you to specify [INAUDIBLE] equals true in compile.

  • It's just automatically set to the right default.

  • AUDIENCE: So I think this one actually

  • works because it should retrace for different bodies

  • of training.

  • FRANCOIS CHOLLET: This?

  • AUDIENCE: This particular example should work in a graph

  • with--

  • FRANCOIS CHOLLET: Last time I checked,

  • it would not work with autograph unless--

  • AUDIENCE: Its final training is in Python Boolean.

  • But if it's a tensor, it's not.

  • FRANCOIS CHOLLET: Implicitly, it's actually tensor because--

  • so the reason why training is usually a tensor

  • is because we use the same graph when you do fits and evaluates.

  • And we change the value of training by fitting the value.

  • Training is always symbolic.

  • But yeah, you're right.

  • If it's a Python Boolean, this works fine.

  • In fact, autograph is not even relevant

  • because the if statements are just going

  • to be ignored by autograph.

  • AUDIENCE: What's the actual problem with [INAUDIBLE]??

  • I thought it was fine to have the [INAUDIBLE]

  • up on one side of the [INAUDIBLE]..

  • AUDIENCE: Is that there isn't a corresponding output.

  • And so they're-- like autograph can't match the function

  • signatures?

  • AUDIENCE: Aren't those not outputs of each branch, though?

  • So it would just be the [INAUDIBLE] in--

  • It's like an API issue with [INAUDIBLE] or something.

  • AUDIENCE: Yes.

  • It's a built-in issue.

  • There's a long thread on this.

  • FRANCOIS CHOLLET: Potentially fixable.

  • It's potentially fixable.

  • Anyway, this was just for the sake of example.

  • There are also layers that are more fundamentally

  • nongraphable, like a tree LSTM, for instance.

  • AUDIENCE: [INAUDIBLE] to that basically, the second question

  • is, why is it important that you use the same wrap for both fit

  • and evaluate?

  • Given that, for instance, in graph mode, the training

  • versus inference flag, like in the olden days,

  • I think that was a Python Boolean, right?

  • FRANCOIS CHOLLET: Yes, that's correct.

  • That's the reason why, historically, you

  • had one graph for string and one graph for inference

  • is because you would do inference

  • on a separate machine that would load a checkpoint

  • and run asynchronously compared to the main training.

  • That's what we're used to at Google.

  • In Keras, however, there's going to be a running evaluation

  • at the end of each epoch.

  • So potentially, you're going to be running evaluation

  • very often.

  • And you're going to be doing that on the same machine.

  • So it's actually quite a bit more

  • efficient to use a single graph to do this instead

  • of keeping it on two graphs.

  • AUDIENCE: If you have something and you

  • call batch normal on it, that if they're in the same graph--

  • for instance, you don't have to declare at that time

  • whether you're doing training and inference--

  • you can just have your tensor.

  • You can do whatever you want downstream with it.

  • Whereas, if you have separate graphs,

  • if you, for instance, output like a result in training mode

  • and a result in inference mode, then the user

  • has to track that.

  • And it's just-- it's not as pleasant.

  • AUDIENCE: Certainly, it should be

  • maybe different graphs that share all of the variables

  • or even different functions.

  • AUDIENCE: Different--

  • AUDIENCE: Yeah.

  • AUDIENCE: Is this something that's

  • like maybe an implementation detail that--

  • FRANCOIS CHOLLET: It is very much an implementation detail.

  • We could choose to generate a new graph

  • when you do evaluation.

  • Again, the reason this decision was made initially

  • is really efficiencies because two graphs is actually--

  • [INAUDIBLE] graph this big.

  • And even though the two graphs are almost entirely redundant,

  • they only differ by a few counts for an actual part

  • layer and batch normal layers.

  • So we saw that having a symbolic training argument

  • is actually much more efficient.

FRANCOIS CHOLLET: Hello, everyone.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it