Placeholder Image

Subtitles section Play video

  • ALEXANDRE PASSOS: Hi.

  • My name is Alex.

  • And I'm here to tell you today about resources and variance.

  • And really this is a talk about state in TensorFlow

  • and stuff that got accidentally represented in state

  • in TensorFlow for far too long.

  • So what is state?

  • I would love to be able to stand here, or rather

  • sit here, and tell you that an operation is stateful

  • if either executing it has a side effect,

  • or if its output depends on something

  • other than the value of its input.

  • But this is not what TensorFlow means by state flow.

  • Sadly, TensorFlow goes by the [INAUDIBLE] notion

  • that the meaning of a word is defined by its usage.

  • So state in TensorFlow is defined

  • by this one bit that gets flipped and means all sorts

  • of very interesting things.

  • So, for example, this slide is wrong.

  • tf.print is stateful.

  • It has a side effect.

  • Yay.

  • tf dataset from tensor slices has no side effects,

  • because the data set operations are value types,

  • and they're stateless.

  • And yet, that kernel is marked as stateful.

  • Because one of the effects of marking something

  • as stateful in TensorFlow is that it

  • disables constant folding.

  • And constant folding can be buggy with data sets.

  • Iterators, on the other hand, are stateful.

  • This might lead you to think that there

  • is some meaning to this.

  • But there are also some things in TensorFlow

  • that could go either way.

  • So to differentiate while loops, we have stacks.

  • So that when you're doing the forward press of the loop,

  • you push things into a stack.

  • And when you're doing the backward pass,

  • you pop things from the stack.

  • So you can look at intermediate activations and stuff.

  • And those things were stateful in tf V1,

  • but they're stateless in tf V2.

  • Tensor lists that you can use to aggregate stuff

  • from many iterations of a loop into a single view,

  • or do the reverse, they're also stateful in tf V1 and stateless

  • in tf V2.

  • AUDIENCE: Is that because we didn't invent the stateless way

  • until later?

  • ALEXANDRE PASSOS: Because we did not invent the stateless way

  • until later.

  • Yes.

  • So I want to spend the rest of the stock talking about how

  • statefulness is represented in tf V1, some of the problems

  • with that, how we're fixing those problems in tf V2,

  • and how we can deal with state, and also with things

  • that are not necessarily easily representable

  • with dense tensors.

  • So how is statefulness represented?

  • In one of two ways--

  • the most obvious way is that if you will go on the TensorFlow

  • source code, and you find where ops are registered,

  • you will see this bit.

  • Set is stateful.

  • And the definition of state in TensorFlow

  • is that opdevs that have this bit set are stateful.

  • And all sorts of places in the runtime

  • are going to look for that bit and behave differently

  • if that bit is set.

  • And people set the bit because they

  • want any of those behaviors.

  • And this is something we need to clean up.

  • And I think we might have a chance to clean this up

  • with the MLIR dialect of TensorFlow, which

  • is going to have more finer grained bits.

  • But until then, we're stuck with this one bit

  • that has too much precision.

  • So among other things, what does this bit mean?

  • It means that TensorFlow will not do constant folding.

  • This includes the two or three separate systems

  • in TensorFlow that do constant folding.

  • All of them know how to bypass stateful operations.

  • Similarly, there are at least two different places

  • in TensorFlow that do common sub expression elimination.

  • And they refuse to do common sub expression elimination

  • of stateful operations, which is very good, because if you're

  • to do that, and you have a neural network

  • with many layers, and your layers are initialized

  • from a random op, all of the layers with the same shape

  • would be initialized with exactly the same random values.

  • AUDIENCE: And all your prints would potentially

  • be collapsed into a single print.

  • ALEXANDRE PASSOS: Only prints of the identical string

  • would be collapsed into a single print.

  • Because otherwise we would have enough information

  • to disambiguate those.

  • But statefulness also means some things that

  • are not very obvious at all, like the op kernel

  • instances that the runtime uses to represent the computation

  • to run are reused across sessions for op kernels

  • that are for stateful ops that have the same name.

  • And there are also a somewhat long tail of obscure behavior

  • changes, like parallel four behaves

  • slightly different for stateful operations.

  • And people are known to set a stateful bit for any one

  • of these reasons and more.

  • The other way of representing state in tf

  • that we're trying to get rid of in tf V2

  • is the notion of a ref tensor.

  • And going back to the variable op,

  • it is this thing here, where you can

  • say that a tensor is either of a D type, or of a D type

  • ref of D type.

  • And what that means is that the reason why we did that

  • is that it's very convenient in many cases

  • to be able to keep information in the runtime that persists

  • across call session.run.

  • Specifically, the variables-- if you

  • had to write your code like this, where every session.run

  • you'd feed your variables and then you'd fetch them back,

  • and you were doing some kind of distributed training,

  • you would have so many network round

  • trips and so much extra latency for this,

  • it would be completely impractical.

  • So the idea of the variable op, which

  • is the thing that motivated the ref tensor, is like a constant,

  • but mutable.

  • And if you try to dig for the runtime,

  • you'll find this piece of code, which

  • I think is the most concise representation I could find

  • of how do we represent the distinction between a ref

  • tensor and an auto tensor.

  • This is what the input to an op kernel looks like.

  • And it's essentially a manually implemented ABSL one off,

  • where one, it's either a manually constructed tensor--

  • and the manual constructor isn't there,

  • just so we don't try to re-initialize it

  • in case we're not going to need it--

  • or the spare of a pointer to a tensor

  • and a pointer to a mute x.

  • And if you've ever programmed in C++,

  • you should be terrified right now, because you see a pointer,

  • and you see no comment about who owns this pointer,

  • and what is the lifetime of that pointer?

  • And a good third of the issues of ref variables

  • come from the fact that it's been impossible or very

  • hard to retrofit into the system a coherent notion of ownership

  • of this pointer that's going to be memory safe.

  • But that's not all.

  • The way the ref variables work is

  • that you have a graph that looks like this.

  • You have this variable node whose output

  • is a tensor that can change, and you can feed it

  • to an operation that mutates it, like assign,

  • or you can feed it to an operation that does not

  • mutate it, like identity.

  • If you feed it to an operation that does not mutate it,

  • like identity, the TensorFlow runtime

  • will silently cast that tensor star to a tensor.

  • So make another tensor object that

  • aliases the buffer pointer by that tensor,

  • and just keep going.

  • So the reason why I like this graph is that it's short.

  • It's simple.

  • If you look at every single gradient update

  • that we use for training, it kind of looks like this.

  • But it's also kind of tricky.

  • So we have, I don't know, like 20, 30 people in the room now.

  • Can I get a show of hands on who thinks

  • that the result of the print is the value after the assign?

  • No one.

  • AUDIENCE: What do you mean?

  • The print?