Subtitles section Play video Print subtitles ALEXANDRE PASSOS: Hi. My name is Alex. And I'm here to tell you today about resources and variance. And really this is a talk about state in TensorFlow and stuff that got accidentally represented in state in TensorFlow for far too long. So what is state? I would love to be able to stand here, or rather sit here, and tell you that an operation is stateful if either executing it has a side effect, or if its output depends on something other than the value of its input. But this is not what TensorFlow means by state flow. Sadly, TensorFlow goes by the [INAUDIBLE] notion that the meaning of a word is defined by its usage. So state in TensorFlow is defined by this one bit that gets flipped and means all sorts of very interesting things. So, for example, this slide is wrong. tf.print is stateful. It has a side effect. Yay. tf dataset from tensor slices has no side effects, because the data set operations are value types, and they're stateless. And yet, that kernel is marked as stateful. Because one of the effects of marking something as stateful in TensorFlow is that it disables constant folding. And constant folding can be buggy with data sets. Iterators, on the other hand, are stateful. This might lead you to think that there is some meaning to this. But there are also some things in TensorFlow that could go either way. So to differentiate while loops, we have stacks. So that when you're doing the forward press of the loop, you push things into a stack. And when you're doing the backward pass, you pop things from the stack. So you can look at intermediate activations and stuff. And those things were stateful in tf V1, but they're stateless in tf V2. Tensor lists that you can use to aggregate stuff from many iterations of a loop into a single view, or do the reverse, they're also stateful in tf V1 and stateless in tf V2. AUDIENCE: Is that because we didn't invent the stateless way until later? ALEXANDRE PASSOS: Because we did not invent the stateless way until later. Yes. So I want to spend the rest of the stock talking about how statefulness is represented in tf V1, some of the problems with that, how we're fixing those problems in tf V2, and how we can deal with state, and also with things that are not necessarily easily representable with dense tensors. So how is statefulness represented? In one of two ways-- the most obvious way is that if you will go on the TensorFlow source code, and you find where ops are registered, you will see this bit. Set is stateful. And the definition of state in TensorFlow is that opdevs that have this bit set are stateful. And all sorts of places in the runtime are going to look for that bit and behave differently if that bit is set. And people set the bit because they want any of those behaviors. And this is something we need to clean up. And I think we might have a chance to clean this up with the MLIR dialect of TensorFlow, which is going to have more finer grained bits. But until then, we're stuck with this one bit that has too much precision. So among other things, what does this bit mean? It means that TensorFlow will not do constant folding. This includes the two or three separate systems in TensorFlow that do constant folding. All of them know how to bypass stateful operations. Similarly, there are at least two different places in TensorFlow that do common sub expression elimination. And they refuse to do common sub expression elimination of stateful operations, which is very good, because if you're to do that, and you have a neural network with many layers, and your layers are initialized from a random op, all of the layers with the same shape would be initialized with exactly the same random values. AUDIENCE: And all your prints would potentially be collapsed into a single print. ALEXANDRE PASSOS: Only prints of the identical string would be collapsed into a single print. Because otherwise we would have enough information to disambiguate those. But statefulness also means some things that are not very obvious at all, like the op kernel instances that the runtime uses to represent the computation to run are reused across sessions for op kernels that are for stateful ops that have the same name. And there are also a somewhat long tail of obscure behavior changes, like parallel four behaves slightly different for stateful operations. And people are known to set a stateful bit for any one of these reasons and more. The other way of representing state in tf that we're trying to get rid of in tf V2 is the notion of a ref tensor. And going back to the variable op, it is this thing here, where you can say that a tensor is either of a D type, or of a D type ref of D type. And what that means is that the reason why we did that is that it's very convenient in many cases to be able to keep information in the runtime that persists across call session.run. Specifically, the variables-- if you had to write your code like this, where every session.run you'd feed your variables and then you'd fetch them back, and you were doing some kind of distributed training, you would have so many network round trips and so much extra latency for this, it would be completely impractical. So the idea of the variable op, which is the thing that motivated the ref tensor, is like a constant, but mutable. And if you try to dig for the runtime, you'll find this piece of code, which I think is the most concise representation I could find of how do we represent the distinction between a ref tensor and an auto tensor. This is what the input to an op kernel looks like. And it's essentially a manually implemented ABSL one off, where one, it's either a manually constructed tensor-- and the manual constructor isn't there, just so we don't try to re-initialize it in case we're not going to need it-- or the spare of a pointer to a tensor and a pointer to a mute x. And if you've ever programmed in C++, you should be terrified right now, because you see a pointer, and you see no comment about who owns this pointer, and what is the lifetime of that pointer? And a good third of the issues of ref variables come from the fact that it's been impossible or very hard to retrofit into the system a coherent notion of ownership of this pointer that's going to be memory safe. But that's not all. The way the ref variables work is that you have a graph that looks like this. You have this variable node whose output is a tensor that can change, and you can feed it to an operation that mutates it, like assign, or you can feed it to an operation that does not mutate it, like identity. If you feed it to an operation that does not mutate it, like identity, the TensorFlow runtime will silently cast that tensor star to a tensor. So make another tensor object that aliases the buffer pointer by that tensor, and just keep going. So the reason why I like this graph is that it's short. It's simple. If you look at every single gradient update that we use for training, it kind of looks like this. But it's also kind of tricky. So we have, I don't know, like 20, 30 people in the room now. Can I get a show of hands on who thinks that the result of the print is the value after the assign? No one. AUDIENCE: What do you mean? The print?